diff --git a/econ_em_draws.json b/econ_em_draws.json
index 2ae5bb1..dc64112 100644
--- a/econ_em_draws.json
+++ b/econ_em_draws.json
@@ -1 +1 @@
-{"http://arxiv.org/abs/2310.03435": {"title": "Variational Inference for GARCH-family Models", "link": "http://arxiv.org/abs/2310.03435", "description": "The Bayesian estimation of GARCH-family models has been typically addressed\nthrough Monte Carlo sampling. Variational Inference is gaining popularity and\nattention as a robust approach for Bayesian inference in complex machine\nlearning models; however, its adoption in econometrics and finance is limited.\nThis paper discusses the extent to which Variational Inference constitutes a\nreliable and feasible alternative to Monte Carlo sampling for Bayesian\ninference in GARCH-like models. Through a large-scale experiment involving the\nconstituents of the S&amp;P 500 index, several Variational Inference optimizers, a\nvariety of volatility models, and a case study, we show that Variational\nInference is an attractive, remarkably well-calibrated, and competitive method\nfor Bayesian learning."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2205.04345": {"title": "Joint diagnostic test of regression discontinuity designs: multiple testing problem", "link": "http://arxiv.org/abs/2205.04345", "description": "Current diagnostic tests for regression discontinuity (RD) design face a\nmultiple testing problem. We find a massive over-rejection of the identifying\nrestriction among empirical RD studies published in top-five economics\njournals. Each test achieves a nominal size of 5%; however, the median number\nof tests per study is 12. Consequently, more than one-third of studies reject\nat least one of these tests and their diagnostic procedures are invalid for\njustifying the identifying assumption. We offer a joint testing procedure to\nresolve the multiple testing problem. Our procedure is based on a new joint\nasymptotic normality of local linear estimates and local polynomial density\nestimates. In simulation studies, our joint testing procedures outperform the\nBonferroni correction. We implement the procedure as an R package, rdtest, with\ntwo empirical examples in its vignettes."}, "http://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "http://arxiv.org/abs/2212.04620", "description": "Production functions are potentially misspecified when revenue is used as a\nproxy for output. I formalize and strengthen this common knowledge by showing\nthat neither the production function nor Hicks-neutral productivity can be\nidentified with such a revenue proxy. This result holds under the standard\nassumptions used in the literature for a large class of production functions,\nincluding all commonly used parametric forms. Among the prevalent approaches to\naddress this issue, only those that impose assumptions on the underlying demand\nsystem can possibly identify the production function."}, "http://arxiv.org/abs/2307.13364": {"title": "Tuning-free testing of factor regression against factor-augmented sparse alternatives", "link": "http://arxiv.org/abs/2307.13364", "description": "This study introduces a bootstrap test of the validity of factor regression\nwithin a high-dimensional factor-augmented sparse regression model that\nintegrates factor and sparse regression techniques. The test provides a means\nto assess the suitability of the classical dense factor regression model\ncompared to a sparse plus dense alternative augmenting factor regression with\nidiosyncratic shocks. Our proposed test does not require tuning parameters,\neliminates the need to estimate covariance matrices, and offers simplicity in\nimplementation. The validity of the test is theoretically established under\ntime-series dependence. Through simulation experiments, we demonstrate the\nfavorable finite sample performance of our procedure. Moreover, using the\nFRED-MD dataset, we apply the test and reject the adequacy of the classical\nfactor regression model when the dependent variable is inflation but not when\nit is industrial production. These findings offer insights into selecting\nappropriate models for high-dimensional datasets."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2310.04576": {"title": "Finite Sample Performance of a Conduct Parameter Test in Homogenous Goods Markets", "link": "http://arxiv.org/abs/2310.04576", "description": "We assess the finite sample performance of the conduct parameter test in\nhomogeneous goods markets. Statistical power rises with an increase in the\nnumber of markets, a larger conduct parameter, and a stronger demand rotation\ninstrument. However, even with a moderate number of markets and five firms,\nregardless of instrument strength and the utilization of optimal instruments,\nrejecting the null hypothesis of perfect competition remains challenging. Our\nfindings indicate that empirical results that fail to reject perfect\ncompetition are a consequence of the limited number of markets rather than\nmethodological deficiencies."}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.05311": {"title": "Identification and Estimation in a Class of Potential Outcomes Models", "link": "http://arxiv.org/abs/2310.05311", "description": "This paper develops a class of potential outcomes models characterized by\nthree main features: (i) Unobserved heterogeneity can be represented by a\nvector of potential outcomes and a type describing the manner in which an\ninstrument determines the choice of treatment; (ii) The availability of an\ninstrumental variable that is conditionally independent of unobserved\nheterogeneity; and (iii) The imposition of convex restrictions on the\ndistribution of unobserved heterogeneity. The proposed class of models\nencompasses multiple classical and novel research designs, yet possesses a\ncommon structure that permits a unifying analysis of identification and\nestimation. In particular, we establish that these models share a common\nnecessary and sufficient condition for identifying certain causal parameters.\nOur identification results are constructive in that they yield estimating\nmoment conditions for the parameters of interest. Focusing on a leading special\ncase of our framework, we further show how these estimating moment conditions\nmay be modified to be doubly robust. The corresponding double robust estimators\nare shown to be asymptotically normally distributed, bootstrap based inference\nis shown to be asymptotically valid, and the semi-parametric efficiency bound\nis derived for those parameters that are root-n estimable. We illustrate the\nusefulness of our results for developing, identifying, and estimating causal\nmodels through an empirical evaluation of the role of mental health as a\nmediating variable in the Moving To Opportunity experiment."}, "http://arxiv.org/abs/2310.05761": {"title": "Robust Minimum Distance Inference in Structural Models", "link": "http://arxiv.org/abs/2310.05761", "description": "This paper proposes minimum distance inference for a structural parameter of\ninterest, which is robust to the lack of identification of other structural\nnuisance parameters. Some choices of the weighting matrix lead to asymptotic\nchi-squared distributions with degrees of freedom that can be consistently\nestimated from the data, even under partial identification. In any case,\nknowledge of the level of under-identification is not required. We study the\npower of our robust test. Several examples show the wide applicability of the\nprocedure and a Monte Carlo investigates its finite sample performance. Our\nidentification-robust inference method can be applied to make inferences on\nboth calibrated (fixed) parameters and any other structural parameter of\ninterest. We illustrate the method's usefulness by applying it to a structural\nmodel on the non-neutrality of monetary policy, as in \\cite{nakamura2018high},\nwhere we empirically evaluate the validity of the calibrated parameters and we\ncarry out robust inference on the slope of the Phillips curve and the\ninformation effect."}, "http://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "http://arxiv.org/abs/2302.13066", "description": "Different proxy variables used in fiscal policy SVARs lead to contradicting\nconclusions regarding the size of fiscal multipliers. In this paper, we show\nthat the conflicting results are due to violations of the exogeneity\nassumptions, i.e. the commonly used proxies are endogenously related to the\nstructural shocks. We propose a novel approach to include proxy variables into\na Bayesian non-Gaussian SVAR, tailored to accommodate potentially endogenous\nproxy variables. Using our model, we show that increasing government spending\nis a more effective tool to stimulate the economy than reducing taxes. We\nconstruct new exogenous proxies that can be used in the traditional proxy VAR\napproach resulting in similar estimates compared to our proposed hybrid SVAR\nmodel."}, "http://arxiv.org/abs/2303.01863": {"title": "Constructing High Frequency Economic Indicators by Imputation", "link": "http://arxiv.org/abs/2303.01863", "description": "Monthly and weekly economic indicators are often taken to be the largest\ncommon factor estimated from high and low frequency data, either separately or\njointly. To incorporate mixed frequency information without directly modeling\nthem, we target a low frequency diffusion index that is already available, and\ntreat high frequency values as missing. We impute these values using multiple\nfactors estimated from the high frequency data. In the empirical examples\nconsidered, static matrix completion that does not account for serial\ncorrelation in the idiosyncratic errors yields imprecise estimates of the\nmissing values irrespective of how the factors are estimated. Single equation\nand systems-based dynamic procedures that account for serial correlation yield\nimputed values that are closer to the observed low frequency ones. This is the\ncase in the counterfactual exercise that imputes the monthly values of consumer\nsentiment series before 1978 when the data was released only on a quarterly\nbasis. This is also the case for a weekly version of the CFNAI index of\neconomic activity that is imputed using seasonally unadjusted data. The imputed\nseries reveals episodes of increased variability of weekly economic information\nthat are masked by the monthly data, notably around the 2014-15 collapse in oil\nprices."}, "http://arxiv.org/abs/2310.06242": {"title": "Treatment Choice, Mean Square Regret and Partial Identification", "link": "http://arxiv.org/abs/2310.06242", "description": "We consider a decision maker who faces a binary treatment choice when their\nwelfare is only partially identified from data. We contribute to the literature\nby anchoring our finite-sample analysis on mean square regret, a decision\ncriterion advocated by Kitagawa, Lee, and Qiu (2022). We find that optimal\nrules are always fractional, irrespective of the width of the identified set\nand precision of its estimate. The optimal treatment fraction is a simple\nlogistic transformation of the commonly used t-statistic multiplied by a factor\ncalculated by a simple constrained optimization. This treatment fraction gets\ncloser to 0.5 as the width of the identified set becomes wider, implying the\ndecision maker becomes more cautious against the adversarial Nature."}, "http://arxiv.org/abs/2009.01995": {"title": "Instrument Validity for Heterogeneous Causal Effects", "link": "http://arxiv.org/abs/2009.01995", "description": "This paper provides a general framework for testing instrument validity in\nheterogeneous causal effect models. The generalization includes the cases where\nthe treatment can be multivalued ordered or unordered. Based on a series of\ntestable implications, we propose a nonparametric test which is proved to be\nasymptotically size controlled and consistent. Compared to the tests in the\nliterature, our test can be applied in more general settings and may achieve\npower improvement. Refutation of instrument validity by the test helps detect\ninvalid instruments that may yield implausible results on causal effects.\nEvidence that the test performs well on finite samples is provided via\nsimulations. We revisit the empirical study on return to schooling to\ndemonstrate application of the proposed test in practice. An extended\ncontinuous mapping theorem and an extended delta method, which may be of\nindependent interest, are provided to establish the asymptotic distribution of\nthe test statistic under null."}, "http://arxiv.org/abs/2009.07551": {"title": "Manipulation-Robust Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2009.07551", "description": "We present a new identification condition for regression discontinuity\ndesigns. We replace the local randomization of Lee (2008) with two restrictions\non its threat, namely, the manipulation of the running variable. Furthermore,\nwe provide the first auxiliary assumption of McCrary's (2008) diagnostic test\nto detect manipulation. Based on our auxiliary assumption, we derive a novel\nexpression of moments that immediately implies the worst-case bounds of Gerard,\nRokkanen, and Rothe (2020) and an enhanced interpretation of their target\nparameters. We highlight two issues: an overlooked source of identification\nfailure, and a missing auxiliary assumption to detect manipulation. In the case\nstudies, we illustrate our solution to these issues using institutional details\nand economic theories."}, "http://arxiv.org/abs/2205.02274": {"title": "Reducing Marketplace Interference Bias Via Shadow Prices", "link": "http://arxiv.org/abs/2205.02274", "description": "Marketplace companies rely heavily on experimentation when making changes to\nthe design or operation of their platforms. The workhorse of experimentation is\nthe randomized controlled trial (RCT), or A/B test, in which users are randomly\nassigned to treatment or control groups. However, marketplace interference\ncauses the Stable Unit Treatment Value Assumption (SUTVA) to be violated,\nleading to bias in the standard RCT metric. In this work, we propose techniques\nfor platforms to run standard RCTs and still obtain meaningful estimates\ndespite the presence of marketplace interference. We specifically consider a\ngeneralized matching setting, in which the platform explicitly matches supply\nwith demand via a linear programming algorithm. Our first proposal is for the\nplatform to estimate the value of global treatment and global control via\noptimization. We prove that this approach is unbiased in the fluid limit. Our\nsecond proposal is to compare the average shadow price of the treatment and\ncontrol groups rather than the total value accrued by each group. We prove that\nthis technique corresponds to the correct first-order approximation (in a\nTaylor series sense) of the value function of interest even in a finite-size\nsystem. We then use this result to prove that, under reasonable assumptions,\nour estimator is less biased than the RCT estimator. At the heart of our result\nis the idea that it is relatively easy to model interference in matching-driven\nmarketplaces since, in such markets, the platform intermediates the spillover."}, "http://arxiv.org/abs/2208.09638": {"title": "Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability", "link": "http://arxiv.org/abs/2208.09638", "description": "What is the purpose of pre-analysis plans, and how should they be designed?\nWe propose a principal-agent model where a decision-maker relies on selective\nbut truthful reports by an analyst. The analyst has data access, and\nnon-aligned objectives. In this model, the implementation of statistical\ndecision rules (tests, estimators) requires an incentive-compatible mechanism.\nWe first characterize which decision rules can be implemented. We then\ncharacterize optimal statistical decision rules subject to implementability. We\nshow that implementation requires pre-analysis plans. Focussing specifically on\nhypothesis tests, we show that optimal rejection rules pre-register a valid\ntest for the case when all data is reported, and make worst-case assumptions\nabout unreported data. Optimal tests can be found as a solution to a\nlinear-programming problem."}, "http://arxiv.org/abs/2302.11505": {"title": "Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes", "link": "http://arxiv.org/abs/2302.11505", "description": "This paper studies settings where the analyst is interested in identifying\nand estimating the average causal effect of a binary treatment on an outcome.\nWe consider a setup in which the outcome realization does not get immediately\nrealized after the treatment assignment, a feature that is ubiquitous in\nempirical settings. The period between the treatment and the realization of the\noutcome allows other observed actions to occur and affect the outcome. In this\ncontext, we study several regression-based estimands routinely used in\nempirical work to capture the average treatment effect and shed light on\ninterpreting them in terms of ceteris paribus effects, indirect causal effects,\nand selection terms. We obtain three main and related takeaways. First, the\nthree most popular estimands do not generally satisfy what we call \\emph{strong\nsign preservation}, in the sense that these estimands may be negative even when\nthe treatment positively affects the outcome conditional on any possible\ncombination of other actions. Second, the most popular regression that includes\nthe other actions as controls satisfies strong sign preservation \\emph{if and\nonly if} these actions are mutually exclusive binary variables. Finally, we\nshow that a linear regression that fully stratifies the other actions leads to\nestimands that satisfy strong sign preservation."}, "http://arxiv.org/abs/2302.13455": {"title": "Nickell Bias in Panel Local Projection: Financial Crises Are Worse Than You Think", "link": "http://arxiv.org/abs/2302.13455", "description": "Local Projection is widely used for impulse response estimation, with the\nFixed Effect (FE) estimator being the default for panel data. This paper\nhighlights the presence of Nickell bias for all regressors in the FE estimator,\neven if lagged dependent variables are absent in the regression. This bias is\nthe consequence of the inherent panel predictive specification. We recommend\nusing the split-panel jackknife estimator to eliminate the asymptotic bias and\nrestore the standard statistical inference. Revisiting three macro-finance\nstudies on the linkage between financial crises and economic contraction, we\nfind that the FE estimator substantially underestimates the post-crisis\neconomic losses."}}
\ No newline at end of file
+{"http://arxiv.org/abs/2310.03435": {"title": "Variational Inference for GARCH-family Models", "link": "http://arxiv.org/abs/2310.03435", "description": "The Bayesian estimation of GARCH-family models has been typically addressed\nthrough Monte Carlo sampling. Variational Inference is gaining popularity and\nattention as a robust approach for Bayesian inference in complex machine\nlearning models; however, its adoption in econometrics and finance is limited.\nThis paper discusses the extent to which Variational Inference constitutes a\nreliable and feasible alternative to Monte Carlo sampling for Bayesian\ninference in GARCH-like models. Through a large-scale experiment involving the\nconstituents of the S&amp;P 500 index, several Variational Inference optimizers, a\nvariety of volatility models, and a case study, we show that Variational\nInference is an attractive, remarkably well-calibrated, and competitive method\nfor Bayesian learning."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2205.04345": {"title": "Joint diagnostic test of regression discontinuity designs: multiple testing problem", "link": "http://arxiv.org/abs/2205.04345", "description": "Current diagnostic tests for regression discontinuity (RD) design face a\nmultiple testing problem. We find a massive over-rejection of the identifying\nrestriction among empirical RD studies published in top-five economics\njournals. Each test achieves a nominal size of 5%; however, the median number\nof tests per study is 12. Consequently, more than one-third of studies reject\nat least one of these tests and their diagnostic procedures are invalid for\njustifying the identifying assumption. We offer a joint testing procedure to\nresolve the multiple testing problem. Our procedure is based on a new joint\nasymptotic normality of local linear estimates and local polynomial density\nestimates. In simulation studies, our joint testing procedures outperform the\nBonferroni correction. We implement the procedure as an R package, rdtest, with\ntwo empirical examples in its vignettes."}, "http://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "http://arxiv.org/abs/2212.04620", "description": "Production functions are potentially misspecified when revenue is used as a\nproxy for output. I formalize and strengthen this common knowledge by showing\nthat neither the production function nor Hicks-neutral productivity can be\nidentified with such a revenue proxy. This result holds under the standard\nassumptions used in the literature for a large class of production functions,\nincluding all commonly used parametric forms. Among the prevalent approaches to\naddress this issue, only those that impose assumptions on the underlying demand\nsystem can possibly identify the production function."}, "http://arxiv.org/abs/2307.13364": {"title": "Tuning-free testing of factor regression against factor-augmented sparse alternatives", "link": "http://arxiv.org/abs/2307.13364", "description": "This study introduces a bootstrap test of the validity of factor regression\nwithin a high-dimensional factor-augmented sparse regression model that\nintegrates factor and sparse regression techniques. The test provides a means\nto assess the suitability of the classical dense factor regression model\ncompared to a sparse plus dense alternative augmenting factor regression with\nidiosyncratic shocks. Our proposed test does not require tuning parameters,\neliminates the need to estimate covariance matrices, and offers simplicity in\nimplementation. The validity of the test is theoretically established under\ntime-series dependence. Through simulation experiments, we demonstrate the\nfavorable finite sample performance of our procedure. Moreover, using the\nFRED-MD dataset, we apply the test and reject the adequacy of the classical\nfactor regression model when the dependent variable is inflation but not when\nit is industrial production. These findings offer insights into selecting\nappropriate models for high-dimensional datasets."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2310.04576": {"title": "Finite Sample Performance of a Conduct Parameter Test in Homogenous Goods Markets", "link": "http://arxiv.org/abs/2310.04576", "description": "We assess the finite sample performance of the conduct parameter test in\nhomogeneous goods markets. Statistical power rises with an increase in the\nnumber of markets, a larger conduct parameter, and a stronger demand rotation\ninstrument. However, even with a moderate number of markets and five firms,\nregardless of instrument strength and the utilization of optimal instruments,\nrejecting the null hypothesis of perfect competition remains challenging. Our\nfindings indicate that empirical results that fail to reject perfect\ncompetition are a consequence of the limited number of markets rather than\nmethodological deficiencies."}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.05311": {"title": "Identification and Estimation in a Class of Potential Outcomes Models", "link": "http://arxiv.org/abs/2310.05311", "description": "This paper develops a class of potential outcomes models characterized by\nthree main features: (i) Unobserved heterogeneity can be represented by a\nvector of potential outcomes and a type describing the manner in which an\ninstrument determines the choice of treatment; (ii) The availability of an\ninstrumental variable that is conditionally independent of unobserved\nheterogeneity; and (iii) The imposition of convex restrictions on the\ndistribution of unobserved heterogeneity. The proposed class of models\nencompasses multiple classical and novel research designs, yet possesses a\ncommon structure that permits a unifying analysis of identification and\nestimation. In particular, we establish that these models share a common\nnecessary and sufficient condition for identifying certain causal parameters.\nOur identification results are constructive in that they yield estimating\nmoment conditions for the parameters of interest. Focusing on a leading special\ncase of our framework, we further show how these estimating moment conditions\nmay be modified to be doubly robust. The corresponding double robust estimators\nare shown to be asymptotically normally distributed, bootstrap based inference\nis shown to be asymptotically valid, and the semi-parametric efficiency bound\nis derived for those parameters that are root-n estimable. We illustrate the\nusefulness of our results for developing, identifying, and estimating causal\nmodels through an empirical evaluation of the role of mental health as a\nmediating variable in the Moving To Opportunity experiment."}, "http://arxiv.org/abs/2310.05761": {"title": "Robust Minimum Distance Inference in Structural Models", "link": "http://arxiv.org/abs/2310.05761", "description": "This paper proposes minimum distance inference for a structural parameter of\ninterest, which is robust to the lack of identification of other structural\nnuisance parameters. Some choices of the weighting matrix lead to asymptotic\nchi-squared distributions with degrees of freedom that can be consistently\nestimated from the data, even under partial identification. In any case,\nknowledge of the level of under-identification is not required. We study the\npower of our robust test. Several examples show the wide applicability of the\nprocedure and a Monte Carlo investigates its finite sample performance. Our\nidentification-robust inference method can be applied to make inferences on\nboth calibrated (fixed) parameters and any other structural parameter of\ninterest. We illustrate the method's usefulness by applying it to a structural\nmodel on the non-neutrality of monetary policy, as in \\cite{nakamura2018high},\nwhere we empirically evaluate the validity of the calibrated parameters and we\ncarry out robust inference on the slope of the Phillips curve and the\ninformation effect."}, "http://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "http://arxiv.org/abs/2302.13066", "description": "Different proxy variables used in fiscal policy SVARs lead to contradicting\nconclusions regarding the size of fiscal multipliers. In this paper, we show\nthat the conflicting results are due to violations of the exogeneity\nassumptions, i.e. the commonly used proxies are endogenously related to the\nstructural shocks. We propose a novel approach to include proxy variables into\na Bayesian non-Gaussian SVAR, tailored to accommodate potentially endogenous\nproxy variables. Using our model, we show that increasing government spending\nis a more effective tool to stimulate the economy than reducing taxes. We\nconstruct new exogenous proxies that can be used in the traditional proxy VAR\napproach resulting in similar estimates compared to our proposed hybrid SVAR\nmodel."}, "http://arxiv.org/abs/2303.01863": {"title": "Constructing High Frequency Economic Indicators by Imputation", "link": "http://arxiv.org/abs/2303.01863", "description": "Monthly and weekly economic indicators are often taken to be the largest\ncommon factor estimated from high and low frequency data, either separately or\njointly. To incorporate mixed frequency information without directly modeling\nthem, we target a low frequency diffusion index that is already available, and\ntreat high frequency values as missing. We impute these values using multiple\nfactors estimated from the high frequency data. In the empirical examples\nconsidered, static matrix completion that does not account for serial\ncorrelation in the idiosyncratic errors yields imprecise estimates of the\nmissing values irrespective of how the factors are estimated. Single equation\nand systems-based dynamic procedures that account for serial correlation yield\nimputed values that are closer to the observed low frequency ones. This is the\ncase in the counterfactual exercise that imputes the monthly values of consumer\nsentiment series before 1978 when the data was released only on a quarterly\nbasis. This is also the case for a weekly version of the CFNAI index of\neconomic activity that is imputed using seasonally unadjusted data. The imputed\nseries reveals episodes of increased variability of weekly economic information\nthat are masked by the monthly data, notably around the 2014-15 collapse in oil\nprices."}, "http://arxiv.org/abs/2310.06242": {"title": "Treatment Choice, Mean Square Regret and Partial Identification", "link": "http://arxiv.org/abs/2310.06242", "description": "We consider a decision maker who faces a binary treatment choice when their\nwelfare is only partially identified from data. We contribute to the literature\nby anchoring our finite-sample analysis on mean square regret, a decision\ncriterion advocated by Kitagawa, Lee, and Qiu (2022). We find that optimal\nrules are always fractional, irrespective of the width of the identified set\nand precision of its estimate. The optimal treatment fraction is a simple\nlogistic transformation of the commonly used t-statistic multiplied by a factor\ncalculated by a simple constrained optimization. This treatment fraction gets\ncloser to 0.5 as the width of the identified set becomes wider, implying the\ndecision maker becomes more cautious against the adversarial Nature."}, "http://arxiv.org/abs/2009.01995": {"title": "Instrument Validity for Heterogeneous Causal Effects", "link": "http://arxiv.org/abs/2009.01995", "description": "This paper provides a general framework for testing instrument validity in\nheterogeneous causal effect models. The generalization includes the cases where\nthe treatment can be multivalued ordered or unordered. Based on a series of\ntestable implications, we propose a nonparametric test which is proved to be\nasymptotically size controlled and consistent. Compared to the tests in the\nliterature, our test can be applied in more general settings and may achieve\npower improvement. Refutation of instrument validity by the test helps detect\ninvalid instruments that may yield implausible results on causal effects.\nEvidence that the test performs well on finite samples is provided via\nsimulations. We revisit the empirical study on return to schooling to\ndemonstrate application of the proposed test in practice. An extended\ncontinuous mapping theorem and an extended delta method, which may be of\nindependent interest, are provided to establish the asymptotic distribution of\nthe test statistic under null."}, "http://arxiv.org/abs/2009.07551": {"title": "Manipulation-Robust Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2009.07551", "description": "We present a new identification condition for regression discontinuity\ndesigns. We replace the local randomization of Lee (2008) with two restrictions\non its threat, namely, the manipulation of the running variable. Furthermore,\nwe provide the first auxiliary assumption of McCrary's (2008) diagnostic test\nto detect manipulation. Based on our auxiliary assumption, we derive a novel\nexpression of moments that immediately implies the worst-case bounds of Gerard,\nRokkanen, and Rothe (2020) and an enhanced interpretation of their target\nparameters. We highlight two issues: an overlooked source of identification\nfailure, and a missing auxiliary assumption to detect manipulation. In the case\nstudies, we illustrate our solution to these issues using institutional details\nand economic theories."}, "http://arxiv.org/abs/2205.02274": {"title": "Reducing Marketplace Interference Bias Via Shadow Prices", "link": "http://arxiv.org/abs/2205.02274", "description": "Marketplace companies rely heavily on experimentation when making changes to\nthe design or operation of their platforms. The workhorse of experimentation is\nthe randomized controlled trial (RCT), or A/B test, in which users are randomly\nassigned to treatment or control groups. However, marketplace interference\ncauses the Stable Unit Treatment Value Assumption (SUTVA) to be violated,\nleading to bias in the standard RCT metric. In this work, we propose techniques\nfor platforms to run standard RCTs and still obtain meaningful estimates\ndespite the presence of marketplace interference. We specifically consider a\ngeneralized matching setting, in which the platform explicitly matches supply\nwith demand via a linear programming algorithm. Our first proposal is for the\nplatform to estimate the value of global treatment and global control via\noptimization. We prove that this approach is unbiased in the fluid limit. Our\nsecond proposal is to compare the average shadow price of the treatment and\ncontrol groups rather than the total value accrued by each group. We prove that\nthis technique corresponds to the correct first-order approximation (in a\nTaylor series sense) of the value function of interest even in a finite-size\nsystem. We then use this result to prove that, under reasonable assumptions,\nour estimator is less biased than the RCT estimator. At the heart of our result\nis the idea that it is relatively easy to model interference in matching-driven\nmarketplaces since, in such markets, the platform intermediates the spillover."}, "http://arxiv.org/abs/2208.09638": {"title": "Optimal Pre-Analysis Plans: Statistical Decisions Subject to Implementability", "link": "http://arxiv.org/abs/2208.09638", "description": "What is the purpose of pre-analysis plans, and how should they be designed?\nWe propose a principal-agent model where a decision-maker relies on selective\nbut truthful reports by an analyst. The analyst has data access, and\nnon-aligned objectives. In this model, the implementation of statistical\ndecision rules (tests, estimators) requires an incentive-compatible mechanism.\nWe first characterize which decision rules can be implemented. We then\ncharacterize optimal statistical decision rules subject to implementability. We\nshow that implementation requires pre-analysis plans. Focussing specifically on\nhypothesis tests, we show that optimal rejection rules pre-register a valid\ntest for the case when all data is reported, and make worst-case assumptions\nabout unreported data. Optimal tests can be found as a solution to a\nlinear-programming problem."}, "http://arxiv.org/abs/2302.11505": {"title": "Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes", "link": "http://arxiv.org/abs/2302.11505", "description": "This paper studies settings where the analyst is interested in identifying\nand estimating the average causal effect of a binary treatment on an outcome.\nWe consider a setup in which the outcome realization does not get immediately\nrealized after the treatment assignment, a feature that is ubiquitous in\nempirical settings. The period between the treatment and the realization of the\noutcome allows other observed actions to occur and affect the outcome. In this\ncontext, we study several regression-based estimands routinely used in\nempirical work to capture the average treatment effect and shed light on\ninterpreting them in terms of ceteris paribus effects, indirect causal effects,\nand selection terms. We obtain three main and related takeaways. First, the\nthree most popular estimands do not generally satisfy what we call \\emph{strong\nsign preservation}, in the sense that these estimands may be negative even when\nthe treatment positively affects the outcome conditional on any possible\ncombination of other actions. Second, the most popular regression that includes\nthe other actions as controls satisfies strong sign preservation \\emph{if and\nonly if} these actions are mutually exclusive binary variables. Finally, we\nshow that a linear regression that fully stratifies the other actions leads to\nestimands that satisfy strong sign preservation."}, "http://arxiv.org/abs/2302.13455": {"title": "Nickell Bias in Panel Local Projection: Financial Crises Are Worse Than You Think", "link": "http://arxiv.org/abs/2302.13455", "description": "Local Projection is widely used for impulse response estimation, with the\nFixed Effect (FE) estimator being the default for panel data. This paper\nhighlights the presence of Nickell bias for all regressors in the FE estimator,\neven if lagged dependent variables are absent in the regression. This bias is\nthe consequence of the inherent panel predictive specification. We recommend\nusing the split-panel jackknife estimator to eliminate the asymptotic bias and\nrestore the standard statistical inference. Revisiting three macro-finance\nstudies on the linkage between financial crises and economic contraction, we\nfind that the FE estimator substantially underestimates the post-crisis\neconomic losses."}, "http://arxiv.org/abs/2310.07151": {"title": "Identification and Estimation of a Semiparametric Logit Model using Network Data", "link": "http://arxiv.org/abs/2310.07151", "description": "This paper studies the identification and estimation of a semiparametric\nbinary network model in which the unobserved social characteristic is\nendogenous, that is, the unobserved individual characteristic influences both\nthe binary outcome of interest and how links are formed within the network. The\nexact functional form of the latent social characteristic is not known. The\nproposed estimators are obtained based on matching pairs of agents whose\nnetwork formation distributions are the same. The consistency and the\nasymptotic distribution of the estimators are proposed. The finite sample\nproperties of the proposed estimators in a Monte-Carlo simulation are assessed.\nWe conclude this study with an empirical application."}, "http://arxiv.org/abs/2310.07558": {"title": "Smootheness-Adaptive Dynamic Pricing with Nonparametric Demand Learning", "link": "http://arxiv.org/abs/2310.07558", "description": "We study the dynamic pricing problem where the demand function is\nnonparametric and H\\\"older smooth, and we focus on adaptivity to the unknown\nH\\\"older smoothness parameter $\\beta$ of the demand function. Traditionally the\noptimal dynamic pricing algorithm heavily relies on the knowledge of $\\beta$ to\nachieve a minimax optimal regret of\n$\\widetilde{O}(T^{\\frac{\\beta+1}{2\\beta+1}})$. However, we highlight the\nchallenge of adaptivity in this dynamic pricing problem by proving that no\npricing policy can adaptively achieve this minimax optimal regret without\nknowledge of $\\beta$. Motivated by the impossibility result, we propose a\nself-similarity condition to enable adaptivity. Importantly, we show that the\nself-similarity condition does not compromise the problem's inherent complexity\nsince it preserves the regret lower bound\n$\\Omega(T^{\\frac{\\beta+1}{2\\beta+1}})$. Furthermore, we develop a\nsmoothness-adaptive dynamic pricing algorithm and theoretically prove that the\nalgorithm achieves this minimax optimal regret bound without the prior\nknowledge $\\beta$."}, "http://arxiv.org/abs/1910.07452": {"title": "Identifying Network Ties from Panel Data: Theory and an Application to Tax Competition", "link": "http://arxiv.org/abs/1910.07452", "description": "Social interactions determine many economic behaviors, but information on\nsocial ties does not exist in most publicly available and widely used datasets.\nWe present results on the identification of social networks from observational\npanel data that contains no information on social ties between agents. In the\ncontext of a canonical social interactions model, we provide sufficient\nconditions under which the social interactions matrix, endogenous and exogenous\nsocial effect parameters are all globally identified. While this result is\nrelevant across different estimation strategies, we then describe how\nhigh-dimensional estimation techniques can be used to estimate the interactions\nmodel based on the Adaptive Elastic Net GMM method. We employ the method to\nstudy tax competition across US states. We find the identified social\ninteractions matrix implies tax competition differs markedly from the common\nassumption of competition between geographically neighboring states, providing\nfurther insights for the long-standing debate on the relative roles of factor\nmobility and yardstick competition in driving tax setting behavior across\nstates. Most broadly, our identification and application show the analysis of\nsocial interactions can be extended to economic realms where no network data\nexists."}, "http://arxiv.org/abs/2308.00913": {"title": "The Bayesian Context Trees State Space Model for time series modelling and forecasting", "link": "http://arxiv.org/abs/2308.00913", "description": "A hierarchical Bayesian framework is introduced for developing rich mixture\nmodels for real-valued time series, partly motivated by important applications\nin financial time series analysis. At the top level, meaningful discrete states\nare identified as appropriately quantised values of some of the most recent\nsamples. These observable states are described as a discrete context-tree\nmodel. At the bottom level, a different, arbitrary model for real-valued time\nseries -- a base model -- is associated with each state. This defines a very\ngeneral framework that can be used in conjunction with any existing model class\nto build flexible and interpretable mixture models. We call this the Bayesian\nContext Trees State Space Model, or the BCT-X framework. Efficient algorithms\nare introduced that allow for effective, exact Bayesian inference and learning\nin this setting; in particular, the maximum a posteriori probability (MAP)\ncontext-tree model can be identified. These algorithms can be updated\nsequentially, facilitating efficient online forecasting. The utility of the\ngeneral framework is illustrated in two particular instances: When\nautoregressive (AR) models are used as base models, resulting in a nonlinear AR\nmixture model, and when conditional heteroscedastic (ARCH) models are used,\nresulting in a mixture model that offers a powerful and systematic way of\nmodelling the well-known volatility asymmetries in financial data. In\nforecasting, the BCT-X methods are found to outperform state-of-the-art\ntechniques on simulated and real-world data, both in terms of accuracy and\ncomputational requirements. In modelling, the BCT-X structure finds natural\nstructure present in the data. In particular, the BCT-ARCH model reveals a\nnovel, important feature of stock market index data, in the form of an enhanced\nleverage effect."}}
\ No newline at end of file
diff --git a/stat_me_draws.json b/stat_me_draws.json
index c9e3274..4444fcd 100644
--- a/stat_me_draws.json
+++ b/stat_me_draws.json
@@ -1 +1 @@
-{"http://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "http://arxiv.org/abs/2310.03114", "description": "In this article we consider Bayesian parameter inference for a type of\npartially observed stochastic Volterra equation (SVE). SVEs are found in many\nareas such as physics and mathematical finance. In the latter field they can be\nused to represent long memory in unobserved volatility processes. In many cases\nof practical interest, SVEs must be time-discretized and then parameter\ninference is based upon the posterior associated to this time-discretized\nprocess. Based upon recent studies on time-discretization of SVEs (e.g. Richard\net al. 2021), we use Euler-Maruyama methods for the afore-mentioned\ndiscretization. We then show how multilevel Markov chain Monte Carlo (MCMC)\nmethods (Jasra et al. 2018) can be applied in this context. In the examples we\nstudy, we give a proof that shows that the cost to achieve a mean square error\n(MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon&gt;0$, is\n$\\mathcal{O}(\\epsilon^{-20/9})$. If one uses a single level MCMC method then\nthe cost is $\\mathcal{O}(\\epsilon^{-38/9})$ to achieve the same MSE. We\nillustrate these results in the context of state-space and stochastic\nvolatility models, with the latter applied to real data."}, "http://arxiv.org/abs/2310.03164": {"title": "A Hierarchical Random Effects State-space Model for Modeling Brain Activities from Electroencephalogram Data", "link": "http://arxiv.org/abs/2310.03164", "description": "Mental disorders present challenges in diagnosis and treatment due to their\ncomplex and heterogeneous nature. Electroencephalogram (EEG) has shown promise\nas a potential biomarker for these disorders. However, existing methods for\nanalyzing EEG signals have limitations in addressing heterogeneity and\ncapturing complex brain activity patterns between regions. This paper proposes\na novel random effects state-space model (RESSM) for analyzing large-scale\nmulti-channel resting-state EEG signals, accounting for the heterogeneity of\nbrain connectivities between groups and individual subjects. We incorporate\nmulti-level random effects for temporal dynamical and spatial mapping matrices\nand address nonstationarity so that the brain connectivity patterns can vary\nover time. The model is fitted under a Bayesian hierarchical model framework\ncoupled with a Gibbs sampler. Compared to previous mixed-effects state-space\nmodels, we directly model high-dimensional random effects matrices without\nstructural constraints and tackle the challenge of identifiability. Through\nextensive simulation studies, we demonstrate that our approach yields valid\nestimation and inference. We apply RESSM to a multi-site clinical trial of\nMajor Depressive Disorder (MDD). Our analysis uncovers significant differences\nin resting-state brain temporal dynamics among MDD patients compared to healthy\nindividuals. In addition, we show the subject-level EEG features derived from\nRESSM exhibit a superior predictive value for the heterogeneous treatment\neffect compared to the EEG frequency band power, suggesting the potential of\nEEG as a valuable biomarker for MDD."}, "http://arxiv.org/abs/2310.03258": {"title": "Detecting Electricity Service Equity Issues with Transfer Counterfactual Learning on Large-Scale Outage Datasets", "link": "http://arxiv.org/abs/2310.03258", "description": "Energy justice is a growing area of interest in interdisciplinary energy\nresearch. However, identifying systematic biases in the energy sector remains\nchallenging due to confounding variables, intricate heterogeneity in treatment\neffects, and limited data availability. To address these challenges, we\nintroduce a novel approach for counterfactual causal analysis centered on\nenergy justice. We use subgroup analysis to manage diverse factors and leverage\nthe idea of transfer learning to mitigate data scarcity in each subgroup. In\nour numerical analysis, we apply our method to a large-scale customer-level\npower outage data set and investigate the counterfactual effect of demographic\nfactors, such as income and age of the population, on power outage durations.\nOur results indicate that low-income and elderly-populated areas consistently\nexperience longer power outages, regardless of weather conditions. This points\nto existing biases in the power system and highlights the need for focused\nimprovements in areas with economic challenges."}, "http://arxiv.org/abs/2310.03351": {"title": "Efficiently analyzing large patient registries with Bayesian joint models for longitudinal and time-to-event data", "link": "http://arxiv.org/abs/2310.03351", "description": "The joint modeling of longitudinal and time-to-event outcomes has become a\npopular tool in follow-up studies. However, fitting Bayesian joint models to\nlarge datasets, such as patient registries, can require extended computing\ntimes. To speed up sampling, we divided a patient registry dataset into\nsubsamples, analyzed them in parallel, and combined the resulting Markov chain\nMonte Carlo draws into a consensus distribution. We used a simulation study to\ninvestigate how different consensus strategies perform with joint models. In\nparticular, we compared grouping all draws together with using equal- and\nprecision-weighted averages. We considered scenarios reflecting different\nsample sizes, numbers of data splits, and processor characteristics.\nParallelization of the sampling process substantially decreased the time\nrequired to run the model. We found that the weighted-average consensus\ndistributions for large sample sizes were nearly identical to the target\nposterior distribution. The proposed algorithm has been made available in an R\npackage for joint models, JMbayes2. This work was motivated by the clinical\ninterest in investigating the association between ppFEV1, a commonly measured\nmarker of lung function, and the risk of lung transplant or death, using data\nfrom the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals\nwith 372,366 years of cumulative follow-up). Splitting the registry into five\nsubsamples resulted in an 85\\% decrease in computing time, from 9.22 to 1.39\nhours. Splitting the data and finding a consensus distribution by\nprecision-weighted averaging proved to be a computationally efficient and\nrobust approach to handling large datasets under the joint modeling framework."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2310.03630": {"title": "Model-based Clustering for Network Data via a Latent Shrinkage Position Cluster Model", "link": "http://arxiv.org/abs/2310.03630", "description": "Low-dimensional representation and clustering of network data are tasks of\ngreat interest across various fields. Latent position models are routinely used\nfor this purpose by assuming that each node has a location in a low-dimensional\nlatent space, and enabling node clustering. However, these models fall short in\nsimultaneously determining the optimal latent space dimension and the number of\nclusters. Here we introduce the latent shrinkage position cluster model\n(LSPCM), which addresses this limitation. The LSPCM posits a Bayesian\nnonparametric shrinkage prior on the latent positions' variance parameters\nresulting in higher dimensions having increasingly smaller variances, aiding in\nthe identification of dimensions with non-negligible variance. Further, the\nLSPCM assumes the latent positions follow a sparse finite Gaussian mixture\nmodel, allowing for automatic inference on the number of clusters related to\nnon-empty mixture components. As a result, the LSPCM simultaneously infers the\nlatent space dimensionality and the number of clusters, eliminating the need to\nfit and compare multiple models. The performance of the LSPCM is assessed via\nsimulation studies and demonstrated through application to two real Twitter\nnetwork datasets from sporting and political contexts. Open source software is\navailable to promote widespread use of the LSPCM."}, "http://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "http://arxiv.org/abs/2310.03722", "description": "In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$\nof a Gaussian distribution with unknown variance $\\sigma$. Curiously, he\nemployed both an improper (right Haar) mixture over $\\sigma$ and an improper\n(flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his\nconstruction, which use generalized nonintegrable martingales and an extended\nVille's inequality. While this does yield a sequential t-test, it does not\nyield an ``e-process'' (due to the nonintegrability of his martingale). In this\npaper, we develop two new e-processes and confidence sequences for the same\nsetting: one is a test martingale in a reduced filtration, while the other is\nan e-process in the canonical data filtration. These are respectively obtained\nby swapping Lai's flat mixture for a Gaussian mixture, and swapping the right\nHaar mixture over $\\sigma$ with the maximum likelihood estimate under the null,\nas done in universal inference. We also analyze the width of resulting\nconfidence sequences, which have a curious dependence on the error probability\n$\\alpha$. Numerical experiments are provided along the way to compare and\ncontrast the various approaches."}, "http://arxiv.org/abs/2103.10875": {"title": "Scalable Bayesian computation for crossed and nested hierarchical models", "link": "http://arxiv.org/abs/2103.10875", "description": "We develop sampling algorithms to fit Bayesian hierarchical models, the\ncomputational complexity of which scales linearly with the number of\nobservations and the number of parameters in the model. We focus on crossed\nrandom effect and nested multilevel models, which are used ubiquitously in\napplied sciences. The posterior dependence in both classes is sparse: in\ncrossed random effects models it resembles a random graph, whereas in nested\nmultilevel models it is tree-structured. For each class we identify a framework\nfor scalable computation, building on previous work. Methods for crossed models\nare based on extensions of appropriately designed collapsed Gibbs samplers,\nwhere we introduce the idea of local centering; while methods for nested models\nare based on sparse linear algebra and data augmentation. We provide a\ntheoretical analysis of the proposed algorithms in some simplified settings,\nincluding a comparison with previously proposed methodologies and an\naverage-case analysis based on random graph theory. Numerical experiments,\nincluding two challenging real data analyses on predicting electoral results\nand real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo,\ndisplaying drastic improvement in performance."}, "http://arxiv.org/abs/2106.04106": {"title": "A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance", "link": "http://arxiv.org/abs/2106.04106", "description": "Genome-wide association studies (GWAS) have identified thousands of genetic\nvariants associated with complex traits, and some variants are shown to be\nassociated with multiple complex traits. Genetic covariance between two traits\nis defined as the underlying covariance of genetic effects and can be used to\nmeasure the shared genetic architecture. The data used to estimate such a\ngenetic covariance can be from the same group or different groups of\nindividuals, and the traits can be of different types or collected based on\ndifferent study designs. This paper proposes a unified regression-based\napproach to robust estimation and inference for genetic covariance of general\ntraits that may be associated with genetic variants nonlinearly. The asymptotic\nproperties of the proposed estimator are provided and are shown to be robust\nunder certain model mis-specification. Our method under linear working models\nprovides a robust inference for the narrow-sense genetic covariance, even when\nboth linear models are mis-specified. Numerical experiments are performed to\nsupport the theoretical results. Our method is applied to an outbred mice GWAS\ndata set to study the overlapping genetic effects between the behavioral and\nphysiological phenotypes. The real data results reveal interesting genetic\ncovariance among different mice developmental traits."}, "http://arxiv.org/abs/2112.08417": {"title": "Characterization of causal ancestral graphs for time series with latent confounders", "link": "http://arxiv.org/abs/2112.08417", "description": "In this paper, we introduce a novel class of graphical models for\nrepresenting time lag specific causal relationships and independencies of\nmultivariate time series with unobserved confounders. We completely\ncharacterize these graphs and show that they constitute proper subsets of the\ncurrently employed model classes. As we show, from the novel graphs one can\nthus draw stronger causal inferences -- without additional assumptions. We\nfurther introduce a graphical representation of Markov equivalence classes of\nthe novel graphs. This graphical representation contains more causal knowledge\nthan what current state-of-the-art causal discovery algorithms learn."}, "http://arxiv.org/abs/2112.09313": {"title": "Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects", "link": "http://arxiv.org/abs/2112.09313", "description": "Federated learning of causal estimands may greatly improve estimation\nefficiency by leveraging data from multiple study sites, but robustness to\nheterogeneity and model misspecifications is vital for ensuring validity. We\ndevelop a Federated Adaptive Causal Estimation (FACE) framework to incorporate\nheterogeneous data from multiple sites to provide treatment effect estimation\nand inference for a flexibly specified target population of interest. FACE\naccounts for site-level heterogeneity in the distribution of covariates through\ndensity ratio weighting. To safely incorporate source sites and avoid negative\ntransfer, we introduce an adaptive weighting procedure via a penalized\nregression, which achieves both consistency and optimal efficiency. Our\nstrategy is communication-efficient and privacy-preserving, allowing\nparticipating sites to share summary statistics only once with other sites. We\nconduct both theoretical and numerical evaluations of FACE and apply it to\nconduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273\n(Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic\nhealth records from five VA regional sites. We show that compared to\ntraditional methods, FACE meaningfully increases the precision of treatment\neffect estimates, with reductions in standard errors ranging from $26\\%$ to\n$67\\%$."}, "http://arxiv.org/abs/2208.03246": {"title": "Non-Asymptotic Analysis of Ensemble Kalman Updates: Effective Dimension and Localization", "link": "http://arxiv.org/abs/2208.03246", "description": "Many modern algorithms for inverse problems and data assimilation rely on\nensemble Kalman updates to blend prior predictions with observed data. Ensemble\nKalman methods often perform well with a small ensemble size, which is\nessential in applications where generating each particle is costly. This paper\ndevelops a non-asymptotic analysis of ensemble Kalman updates that rigorously\nexplains why a small ensemble size suffices if the prior covariance has\nmoderate effective dimension due to fast spectrum decay or approximate\nsparsity. We present our theory in a unified framework, comparing several\nimplementations of ensemble Kalman updates that use perturbed observations,\nsquare root filtering, and localization. As part of our analysis, we develop\nnew dimension-free covariance estimation bounds for approximately sparse\nmatrices that may be of independent interest."}, "http://arxiv.org/abs/2307.10972": {"title": "Adaptively Weighted Audits of Instant-Runoff Voting Elections: AWAIRE", "link": "http://arxiv.org/abs/2307.10972", "description": "An election audit is risk-limiting if the audit limits (to a pre-specified\nthreshold) the chance that an erroneous electoral outcome will be certified.\nExtant methods for auditing instant-runoff voting (IRV) elections are either\nnot risk-limiting or require cast vote records (CVRs), the voting system's\nelectronic record of the votes on each ballot. CVRs are not always available,\nfor instance, in jurisdictions that tabulate IRV contests manually.\n\nWe develop an RLA method (AWAIRE) that uses adaptively weighted averages of\ntest supermartingales to efficiently audit IRV elections when CVRs are not\navailable. The adaptive weighting 'learns' an efficient set of hypotheses to\ntest to confirm the election outcome. When accurate CVRs are available, AWAIRE\ncan use them to increase the efficiency to match the performance of existing\nmethods that require CVRs.\n\nWe provide an open-source prototype implementation that can handle elections\nwith up to six candidates. Simulations using data from real elections show that\nAWAIRE is likely to be efficient in practice. We discuss how to extend the\ncomputational approach to handle elections with more candidates.\n\nAdaptively weighted averages of test supermartingales are a general tool,\nuseful beyond election audits to test collections of hypotheses sequentially\nwhile rigorously controlling the familywise error rate."}, "http://arxiv.org/abs/2309.10514": {"title": "Partially Specified Causal Simulations", "link": "http://arxiv.org/abs/2309.10514", "description": "Simulation studies play a key role in the validation of causal inference\nmethods. The simulation results are reliable only if the study is designed\naccording to the promised operational conditions of the method-in-test. Still,\nmany causal inference literature tend to design over-restricted or misspecified\nstudies. In this paper, we elaborate on the problem of improper simulation\ndesign for causal methods and compile a list of desiderata for an effective\nsimulation framework. We then introduce partially randomized causal simulation\n(PARCS), a simulation framework that meets those desiderata. PARCS synthesizes\ndata based on graphical causal models and a wide range of adjustable\nparameters. There is a legible mapping from usual causal assumptions to the\nparameters, thus, users can identify and specify the subset of related\nparameters and randomize the remaining ones to generate a range of complying\ndata-generating processes for their causal method. The result is a more\ncomprehensive and inclusive empirical investigation for causal claims. Using\nPARCS, we reproduce and extend the simulation studies of two well-known causal\ndiscovery and missing data analysis papers to emphasize the necessity of a\nproper simulation design. Our results show that those papers would have\nimproved and extended the findings, had they used PARCS for simulation. The\nframework is implemented as a Python package, too. By discussing the\ncomprehensiveness and transparency of PARCS, we encourage causal inference\nresearchers to utilize it as a standard tool for future works."}, "http://arxiv.org/abs/2310.03776": {"title": "Significance of the negative binomial distribution in multiplicity phenomena", "link": "http://arxiv.org/abs/2310.03776", "description": "The negative binomial distribution (NBD) has been theorized to express a\nscale-invariant property of many-body systems and has been consistently shown\nto outperform other statistical models in both describing the multiplicity of\nquantum-scale events in particle collision experiments and predicting the\nprevalence of cosmological observables, such as the number of galaxies in a\nregion of space. Despite its widespread applicability and empirical success in\nthese contexts, a theoretical justification for the NBD from first principles\nhas remained elusive for fifty years. The accuracy of the NBD in modeling\nhadronic, leptonic, and semileptonic processes is suggestive of a highly\ngeneral principle, which is yet to be understood. This study demonstrates that\na statistical event of the NBD can in fact be derived in a general context via\nthe dynamical equations of a canonical ensemble of particles in Minkowski\nspace. These results describe a fundamental feature of many-body systems that\nis consistent with data from the ALICE and ATLAS experiments and provides an\nexplanation for the emergence of the NBD in these multiplicity observations.\nTwo methods are used to derive this correspondence: the Feynman path integral\nand a hypersurface parametrization of a propagating ensemble."}, "http://arxiv.org/abs/2310.04030": {"title": "Robust inference with GhostKnockoffs in genome-wide association studies", "link": "http://arxiv.org/abs/2310.04030", "description": "Genome-wide association studies (GWASs) have been extensively adopted to\ndepict the underlying genetic architecture of complex diseases. Motivated by\nGWASs' limitations in identifying small effect loci to understand complex\ntraits' polygenicity and fine-mapping putative causal variants from proxy ones,\nwe propose a knockoff-based method which only requires summary statistics from\nGWASs and demonstrate its validity in the presence of relatedness. We show that\nGhostKnockoffs inference is robust to its input Z-scores as long as they are\nfrom valid marginal association tests and their correlations are consistent\nwith the correlations among the corresponding genetic variants. The property\ngeneralizes GhostKnockoffs to other GWASs settings, such as the meta-analysis\nof multiple overlapping studies and studies based on association test\nstatistics deviated from score tests. We demonstrate GhostKnockoffs'\nperformance using empirical simulation and a meta-analysis of nine European\nancestral genome-wide association studies and whole exome/genome sequencing\nstudies. Both results demonstrate that GhostKnockoffs identify more putative\ncausal variants with weak genotype-phenotype associations that are missed by\nconventional GWASs."}, "http://arxiv.org/abs/2310.04082": {"title": "An energy-based model approach to rare event probability estimation", "link": "http://arxiv.org/abs/2310.04082", "description": "The estimation of rare event probabilities plays a pivotal role in diverse\nfields. Our aim is to determine the probability of a hazard or system failure\noccurring when a quantity of interest exceeds a critical value. In our\napproach, the distribution of the quantity of interest is represented by an\nenergy density, characterized by a free energy function. To efficiently\nestimate the free energy, a bias potential is introduced. Using concepts from\nenergy-based models (EBM), this bias potential is optimized such that the\ncorresponding probability density function approximates a pre-defined\ndistribution targeting the failure region of interest. Given the optimal bias\npotential, the free energy function and the rare event probability of interest\ncan be determined. The approach is applicable not just in traditional rare\nevent settings where the variable upon which the quantity of interest relies\nhas a known distribution, but also in inversion settings where the variable\nfollows a posterior distribution. By combining the EBM approach with a Stein\ndiscrepancy-based stopping criterion, we aim for a balanced accuracy-efficiency\ntrade-off. Furthermore, we explore both parametric and non-parametric\napproaches for the bias potential, with the latter eliminating the need for\nchoosing a particular parameterization, but depending strongly on the accuracy\nof the kernel density estimate used in the optimization process. Through three\nillustrative test cases encompassing both traditional and inversion settings,\nwe show that the proposed EBM approach, when properly configured, (i) allows\nstable and efficient estimation of rare event probabilities and (ii) compares\nfavorably against subset sampling approaches."}, "http://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "http://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set\nof likelihood components. This approach provides a flexible framework for\ndrawing inference when the likelihood function of a statistical model is\ncomputationally intractable. While composite likelihood has computational\nadvantages, it can still be demanding when dealing with numerous likelihood\ncomponents and a large sample size. This paper tackles this challenge by\nemploying an approximation of the conventional composite likelihood estimator,\nwhich is derived from an optimization procedure relying on stochastic\ngradients. This novel estimator is shown to be asymptotically normally\ndistributed around the true parameter. In particular, based on the relative\ndivergent rate of the sample size and the number of iterations of the\noptimization, the variance of the limiting distribution is shown to compound\nfor two sources of uncertainty: the sampling variability of the data and the\noptimization noise, with the latter depending on the sampling distribution used\nto construct the stochastic gradients. The advantages of the proposed framework\nare illustrated through simulation studies on two working examples: an Ising\nmodel for binary data and a gamma frailty model for count data. Finally, a\nreal-data application is presented, showing its effectiveness in a large-scale\nmental health survey."}, "http://arxiv.org/abs/1904.06340": {"title": "A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Processes", "link": "http://arxiv.org/abs/1904.06340", "description": "This paper develops a unified and computationally efficient method for\nchange-point estimation along the time dimension in a non-stationary\nspatio-temporal process. By modeling a non-stationary spatio-temporal process\nas a piecewise stationary spatio-temporal process, we consider simultaneous\nestimation of the number and locations of change-points, and model parameters\nin each segment. A composite likelihood-based criterion is developed for\nchange-point and parameters estimation. Under the framework of increasing\ndomain asymptotics, theoretical results including consistency and distribution\nof the estimators are derived under mild conditions. In contrast to classical\nresults in fixed dimensional time series that the localization error of\nchange-point estimator is $O_{p}(1)$, exact recovery of true change-points can\nbe achieved in the spatio-temporal setting. More surprisingly, the consistency\nof change-point estimation can be achieved without any penalty term in the\ncriterion function. In addition, we further establish consistency of the number\nand locations of the change-point estimator under the infill asymptotics\nframework where the time domain is increasing while the spatial sampling domain\nis fixed. A computationally efficient pruned dynamic programming algorithm is\ndeveloped for the challenging criterion optimization problem. Extensive\nsimulation studies and an application to U.S. precipitation data are provided\nto demonstrate the effectiveness and practicality of the proposed method."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2208.00137": {"title": "Efficient estimation and inference for the signed $\\beta$-model in directed signed networks", "link": "http://arxiv.org/abs/2208.00137", "description": "This paper proposes a novel signed $\\beta$-model for directed signed network,\nwhich is frequently encountered in application domains but largely neglected in\nliterature. The proposed signed $\\beta$-model decomposes a directed signed\nnetwork as the difference of two unsigned networks and embeds each node with\ntwo latent factors for in-status and out-status. The presence of negative edges\nleads to a non-concave log-likelihood, and a one-step estimation algorithm is\ndeveloped to facilitate parameter estimation, which is efficient both\ntheoretically and computationally. We also develop an inferential procedure for\npairwise and multiple node comparisons under the signed $\\beta$-model, which\nfills the void of lacking uncertainty quantification for node ranking.\nTheoretical results are established for the coverage probability of confidence\ninterval, as well as the false discovery rate (FDR) control for multiple node\ncomparison. The finite sample performance of the signed $\\beta$-model is also\nexamined through extensive numerical experiments on both synthetic and\nreal-life networks."}, "http://arxiv.org/abs/2208.08401": {"title": "Conformal Inference for Online Prediction with Arbitrary Distribution Shifts", "link": "http://arxiv.org/abs/2208.08401", "description": "We consider the problem of forming prediction sets in an online setting where\nthe distribution generating the data is allowed to vary over time. Previous\napproaches to this problem suffer from over-weighting historical data and thus\nmay fail to quickly react to the underlying dynamics. Here we correct this\nissue and develop a novel procedure with provably small regret over all local\ntime intervals of a given width. We achieve this by modifying the adaptive\nconformal inference (ACI) algorithm of Gibbs and Cand\\`{e}s (2021) to contain\nan additional step in which the step-size parameter of ACI's gradient descent\nupdate is tuned over time. Crucially, this means that unlike ACI, which\nrequires knowledge of the rate of change of the data-generating mechanism, our\nnew procedure is adaptive to both the size and type of the distribution shift.\nOur methods are highly flexible and can be used in combination with any\nbaseline predictive algorithm that produces point estimates or estimated\nquantiles of the target without the need for distributional assumptions. We\ntest our techniques on two real-world datasets aimed at predicting stock market\nvolatility and COVID-19 case counts and find that they are robust and adaptive\nto real-world distribution shifts."}, "http://arxiv.org/abs/2303.01031": {"title": "Identifiability and Consistent Estimation of the Gaussian Chain Graph Model", "link": "http://arxiv.org/abs/2303.01031", "description": "The chain graph model admits both undirected and directed edges in one graph,\nwhere symmetric conditional dependencies are encoded via undirected edges and\nasymmetric causal relations are encoded via directed edges. Though frequently\nencountered in practice, the chain graph model has been largely under\ninvestigated in literature, possibly due to the lack of identifiability\nconditions between undirected and directed edges. In this paper, we first\nestablish a set of novel identifiability conditions for the Gaussian chain\ngraph model, exploiting a low rank plus sparse decomposition of the precision\nmatrix. Further, an efficient learning algorithm is built upon the\nidentifiability conditions to fully recover the chain graph structure.\nTheoretical analysis on the proposed method is conducted, assuring its\nasymptotic consistency in recovering the exact chain graph structure. The\nadvantage of the proposed method is also supported by numerical experiments on\nboth simulated examples and a real application on the Standard &amp; Poor 500 index\ndata."}, "http://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "http://arxiv.org/abs/2305.10817", "description": "We introduce an approach which allows detecting causal relationships between\nvariables for which the time evolution is available. Causality is assessed by a\nvariational scheme based on the Information Imbalance of distance ranks, a\nstatistical test capable of inferring the relative information content of\ndifferent distance measures. We test whether the predictability of a putative\ndriven system Y can be improved by incorporating information from a potential\ndriver system X, without making assumptions on the underlying dynamics and\nwithout the need to compute probability densities of the dynamic variables.\nThis framework makes causality detection possible even for high-dimensional\nsystems where only few of the variables are known or measured. Benchmark tests\non coupled chaotic dynamical systems demonstrate that our approach outperforms\nother model-free causality detection methods, successfully handling both\nunidirectional and bidirectional couplings. We also show that the method can be\nused to robustly detect causality in human electroencephalography data."}, "http://arxiv.org/abs/2309.06264": {"title": "Spectral clustering algorithm for the allometric extension model", "link": "http://arxiv.org/abs/2309.06264", "description": "The spectral clustering algorithm is often used as a binary clustering method\nfor unclassified data by applying the principal component analysis. To study\ntheoretical properties of the algorithm, the assumption of conditional\nhomoscedasticity is often supposed in existing studies. However, this\nassumption is restrictive and often unrealistic in practice. Therefore, in this\npaper, we consider the allometric extension model, that is, the directions of\nthe first eigenvectors of two covariance matrices and the direction of the\ndifference of two mean vectors coincide, and we provide a non-asymptotic bound\nof the error probability of the spectral clustering algorithm for the\nallometric extension model. As a byproduct of the result, we obtain the\nconsistency of the clustering method in high-dimensional settings."}, "http://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "http://arxiv.org/abs/2309.12833", "description": "Discovering causal relationships from observational data is a fundamental yet\nchallenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a\nmethod for causal feature selection which requires data from heterogeneous\nsettings and exploits that causal models are invariant. ICP has been extended\nto general additive noise models and to nonparametric settings using\nconditional independence tests. However, the latter often suffer from low power\n(or poor type I error control) and additive noise models are not suitable for\napplications in which the response is not measured on a continuous scale, but\nreflects categories or counts. Here, we develop transformation-model (TRAM)\nbased ICP, allowing for continuous, categorical, count-type, and\nuninformatively censored responses (these model classes, generally, do not\nallow for identifiability when there is no exogenous heterogeneity). As an\ninvariance test, we propose TRAM-GCM based on the expected conditional\ncovariance between environments and score residuals with uniform asymptotic\nlevel guarantees. For the special case of linear shift TRAMs, we also consider\nTRAM-Wald, which tests invariance based on the Wald statistic. We provide an\nopen-source R package 'tramicp' and evaluate our approach on simulated data and\nin a case study investigating causal features of survival in critically ill\npatients."}, "http://arxiv.org/abs/2310.04452": {"title": "Short text classification with machine learning in the social sciences: The case of climate change on Twitter", "link": "http://arxiv.org/abs/2310.04452", "description": "To analyse large numbers of texts, social science researchers are\nincreasingly confronting the challenge of text classification. When manual\nlabeling is not possible and researchers have to find automatized ways to\nclassify texts, computer science provides a useful toolbox of machine-learning\nmethods whose performance remains understudied in the social sciences. In this\narticle, we compare the performance of the most widely used text classifiers by\napplying them to a typical research scenario in social science research: a\nrelatively small labeled dataset with infrequent occurrence of categories of\ninterest, which is a part of a large unlabeled dataset. As an example case, we\nlook at Twitter communication regarding climate change, a topic of increasing\nscholarly interest in interdisciplinary social science research. Using a novel\ndataset including 5,750 tweets from various international organizations\nregarding the highly ambiguous concept of climate change, we evaluate the\nperformance of methods in automatically classifying tweets based on whether\nthey are about climate change or not. In this context, we highlight two main\nfindings. First, supervised machine-learning methods perform better than\nstate-of-the-art lexicons, in particular as class balance increases. Second,\ntraditional machine-learning methods, such as logistic regression and random\nforest, perform similarly to sophisticated deep-learning methods, whilst\nrequiring much less training time and computational resources. The results have\nimportant implications for the analysis of short texts in social science\nresearch."}, "http://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "http://arxiv.org/abs/2310.04563", "description": "During the COVID-19 pandemic, implementing in-person indoor instruction in a\nsafe manner was a high priority for universities nationwide. To support this\neffort at the University, we developed a mathematical model for estimating the\nrisk of SARS-CoV-2 transmission in university classrooms. This model was used\nto design a safe classroom environment at the University during the COVID-19\npandemic that supported the higher occupancy levels needed to match\npre-pandemic numbers of in-person courses, despite a limited number of large\nclassrooms. A retrospective analysis at the end of the semester confirmed the\nmodel's assessment that the proposed classroom configuration would be safe. Our\nframework is generalizable and was also used to support reopening decisions at\nStanford University. In addition, our methods are flexible; our modeling\nframework was repurposed to plan for large university events and gatherings. We\nfound that our approach and methods work in a wide range of indoor settings and\ncould be used to support reopening planning across various industries, from\nsecondary schools to movie theaters and restaurants."}, "http://arxiv.org/abs/2310.04578": {"title": "TNDDR: Efficient and doubly robust estimation of COVID-19 vaccine effectiveness under the test-negative design", "link": "http://arxiv.org/abs/2310.04578", "description": "While the test-negative design (TND), which is routinely used for monitoring\nseasonal flu vaccine effectiveness (VE), has recently become integral to\nCOVID-19 vaccine surveillance, it is susceptible to selection bias due to\noutcome-dependent sampling. Some studies have addressed the identifiability and\nestimation of causal parameters under the TND, but efficiency bounds for\nnonparametric estimators of the target parameter under the unconfoundedness\nassumption have not yet been investigated. We propose a one-step doubly robust\nand locally efficient estimator called TNDDR (TND doubly robust), which\nutilizes sample splitting and can incorporate machine learning techniques to\nestimate the nuisance functions. We derive the efficient influence function\n(EIF) for the marginal expectation of the outcome under a vaccination\nintervention, explore the von Mises expansion, and establish the conditions for\n$\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR.\nThe proposed TNDDR is supported by both theoretical and empirical\njustifications, and we apply it to estimate COVID-19 VE in an administrative\ndataset of community-dwelling older people (aged $\\geq 60$y) in the province of\nQu\\'ebec, Canada."}, "http://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "http://arxiv.org/abs/2310.04660", "description": "Many scientific questions in biomedical, environmental, and psychological\nresearch involve understanding the impact of multiple factors on outcomes.\nWhile randomized factorial experiments are ideal for this purpose,\nrandomization is infeasible in many empirical studies. Therefore, investigators\noften rely on observational data, where drawing reliable causal inferences for\nmultiple factors remains challenging. As the number of treatment combinations\ngrows exponentially with the number of factors, some treatment combinations can\nbe rare or even missing by chance in observed data, further complicating\nfactorial effects estimation. To address these challenges, we propose a novel\nweighting method tailored to observational studies with multiple factors. Our\napproach uses weighted observational data to emulate a randomized factorial\nexperiment, enabling simultaneous estimation of the effects of multiple factors\nand their interactions. Our investigations reveal a crucial nuance: achieving\nbalance among covariates, as in single-factor scenarios, is necessary but\ninsufficient for unbiasedly estimating factorial effects. Our findings suggest\nthat balancing the factors is also essential in multi-factor settings.\nMoreover, we extend our weighting method to handle missing treatment\ncombinations in observed data. Finally, we study the asymptotic behavior of the\nnew weighting estimators and propose a consistent variance estimator, providing\nreliable inferences on factorial effects in observational studies."}, "http://arxiv.org/abs/2310.04709": {"title": "Time-dependent mediators in survival analysis: Graphical representation of causal assumptions", "link": "http://arxiv.org/abs/2310.04709", "description": "We study time-dependent mediators in survival analysis using a treatment\nseparation approach due to Didelez [2019] and based on earlier work by Robins\nand Richardson [2011]. This approach avoids nested counterfactuals and\ncrossworld assumptions which are otherwise common in mediation analysis. The\ncausal model of treatment, mediators, covariates, confounders and outcome is\nrepresented by causal directed acyclic graphs (DAGs). However, the DAGs tend to\nbe very complex when we have measurements at a large number of time points. We\ntherefore suggest using so-called rolled graphs in which a node represents an\nentire coordinate process instead of a single random variable, leading us to\nfar simpler graphical representations. The rolled graphs are not necessarily\nacyclic; they can be analyzed by $\\delta$-separation which is the appropriate\ngraphical separation criterion in this class of graphs and analogous to\n$d$-separation. In particular, $\\delta$-separation is a graphical tool for\nevaluating if the conditions of the mediation analysis are met or if unmeasured\nconfounders influence the estimated effects. We also state a mediational\ng-formula. This is similar to the approach in Vansteelandt et al. [2019]\nalthough that paper has a different conceptual basis. Finally, we apply this\nframework to a statistical model based on a Cox model with an added treatment\neffect.survival analysis; mediation; causal inference; graphical models; local\nindependence graphs"}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.04919": {"title": "The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models", "link": "http://arxiv.org/abs/2310.04919", "description": "In modern scientific research, the objective is often to identify which\nvariables are associated with an outcome among a large class of potential\npredictors. This goal can be achieved by selecting variables in a manner that\ncontrols the the false discovery rate (FDR), the proportion of irrelevant\npredictors among the selections. Knockoff filtering is a cutting-edge approach\nto variable selection that provides FDR control. Existing knockoff statistics\nfrequently employ linear models to assess relationships between features and\nthe response, but the linearity assumption is often violated in real world\napplications. This may result in poor power to detect truly prognostic\nvariables. We introduce a knockoff statistic based on the conditional\nprediction function (CPF), which can pair with state-of-art machine learning\npredictive models, such as deep neural networks. The CPF statistics can capture\nthe nonlinear relationships between predictors and outcomes while also\naccounting for correlation between features. We illustrate the capability of\nthe CPF statistics to provide superior power over common knockoff statistics\nwith continuous, categorical, and survival outcomes using repeated simulations.\nKnockoff filtering with the CPF statistics is demonstrated using (1) a\nresidential building dataset to select predictors for the actual sales prices\nand (2) the TCGA dataset to select genes that are correlated with disease\nstaging in lung cancer patients."}, "http://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "http://arxiv.org/abs/2310.04924", "description": "Markov chain Monte Carlo significance tests were first introduced by Besag\nand Clifford in [4]. These methods produce statistical valid p-values in\nproblems where sampling from the null hypotheses is intractable. We give an\noverview of the methods of Besag and Clifford and some recent developments. A\nrange of examples and applications are discussed."}, "http://arxiv.org/abs/2310.04934": {"title": "UBSea: A Unified Community Detection Framework", "link": "http://arxiv.org/abs/2310.04934", "description": "Detecting communities in networks and graphs is an important task across many\ndisciplines such as statistics, social science and engineering. There are\ngenerally three different kinds of mixing patterns for the case of two\ncommunities: assortative mixing, disassortative mixing and core-periphery\nstructure. Modularity optimization is a classical way for fitting network\nmodels with communities. However, it can only deal with assortative mixing and\ndisassortative mixing when the mixing pattern is known and fails to discover\nthe core-periphery structure. In this paper, we extend modularity in a\nstrategic way and propose a new framework based on Unified Bigroups Standadized\nEdge-count Analysis (UBSea). It can address all the formerly mentioned\ncommunity mixing structures. In addition, this new framework is able to\nautomatically choose the mixing type to fit the networks. Simulation studies\nshow that the new framework has superb performance in a wide range of settings\nunder the stochastic block model and the degree-corrected stochastic block\nmodel. We show that the new approach produces consistent estimate of the\ncommunities under a suitable signal-to-noise-ratio condition, for the case of a\nblock model with two communities, for both undirected and directed networks.\nThe new method is illustrated through applications to several real-world\ndatasets."}, "http://arxiv.org/abs/2310.05049": {"title": "On Estimation of Optimal Dynamic Treatment Regimes with Multiple Treatments for Survival Data-With Application to Colorectal Cancer Study", "link": "http://arxiv.org/abs/2310.05049", "description": "Dynamic treatment regimes (DTR) are sequential decision rules corresponding\nto several stages of intervention. Each rule maps patients' covariates to\noptional treatments. The optimal dynamic treatment regime is the one that\nmaximizes the mean outcome of interest if followed by the overall population.\nMotivated by a clinical study on advanced colorectal cancer with traditional\nChinese medicine, we propose a censored C-learning (CC-learning) method to\nestimate the dynamic treatment regime with multiple treatments using survival\ndata. To address the challenges of multiple stages with right censoring, we\nmodify the backward recursion algorithm in order to adapt to the flexible\nnumber and timing of treatments. For handling the problem of multiple\ntreatments, we propose a framework from the classification perspective by\ntransferring the problem of optimization with multiple treatment comparisons\ninto an example-dependent cost-sensitive classification problem. With\nclassification and regression tree (CART) as the classifier, the CC-learning\nmethod can produce an estimated optimal DTR with good interpretability. We\ntheoretically prove the optimality of our method and numerically evaluate its\nfinite sample performances through simulation. With the proposed method, we\nidentify the interpretable tree treatment regimes at each stage for the\nadvanced colorectal cancer treatment data from Xiyuan Hospital."}, "http://arxiv.org/abs/2310.05151": {"title": "Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions", "link": "http://arxiv.org/abs/2310.05151", "description": "In clinical trials of longitudinal continuous outcomes, reference based\nimputation (RBI) has commonly been applied to handle missing outcome data in\nsettings where the estimand incorporates the effects of intercurrent events,\ne.g. treatment discontinuation. RBI was originally developed in the multiple\nimputation framework, however recently conditional mean imputation (CMI)\ncombined with the jackknife estimator of the standard error was proposed as a\nway to obtain deterministic treatment effect estimates and correct frequentist\ninference. For both multiple and CMI, a mixed model for repeated measures\n(MMRM) is often used for the imputation model, but this can be computationally\nintensive to fit to multiple data sets (e.g. the jackknife samples) and lead to\nconvergence issues with complex MMRM models with many parameters. Therefore, a\nstep-wise approach based on sequential linear regression (SLR) of the outcomes\nat each visit was developed for the imputation model in the multiple imputation\nframework, but similar developments in the CMI framework are lacking. In this\narticle, we fill this gap in the literature by proposing a SLR approach to\nimplement RBI in the CMI framework, and justify its validity using theoretical\nresults and simulations. We also illustrate our proposal on a real data\napplication."}, "http://arxiv.org/abs/2310.05398": {"title": "Statistical Inference for Modulation Index in Phase-Amplitude Coupling", "link": "http://arxiv.org/abs/2310.05398", "description": "Phase-amplitude coupling is a phenomenon observed in several neurological\nprocesses, where the phase of one signal modulates the amplitude of another\nsignal with a distinct frequency. The modulation index (MI) is a common\ntechnique used to quantify this interaction by assessing the Kullback-Leibler\ndivergence between a uniform distribution and the empirical conditional\ndistribution of amplitudes with respect to the phases of the observed signals.\nThe uniform distribution is an ideal representation that is expected to appear\nunder the absence of coupling. However, it does not reflect the statistical\nproperties of coupling values caused by random chance. In this paper, we\npropose a statistical framework for evaluating the significance of an observed\nMI value based on a null hypothesis that a MI value can be entirely explained\nby chance. Significance is obtained by comparing the value with a reference\ndistribution derived under the null hypothesis of independence (i.e., no\ncoupling) between signals. We derived a closed-form distribution of this null\nmodel, resulting in a scaled beta distribution. To validate the efficacy of our\nproposed framework, we conducted comprehensive Monte Carlo simulations,\nassessing the significance of MI values under various experimental scenarios,\nincluding amplitude modulation, trains of spikes, and sequences of\nhigh-frequency oscillations. Furthermore, we corroborated the reliability of\nour model by comparing its statistical significance thresholds with reported\nvalues from other research studies conducted under different experimental\nsettings. Our method offers several advantages such as meta-analysis\nreliability, simplicity and computational efficiency, as it provides p-values\nand significance levels without resorting to generating surrogate data through\nsampling procedures."}, "http://arxiv.org/abs/2310.05526": {"title": "Projecting infinite time series graphs to finite marginal graphs using number theory", "link": "http://arxiv.org/abs/2310.05526", "description": "In recent years, a growing number of method and application works have\nadapted and applied the causal-graphical-model framework to time series data.\nMany of these works employ time-resolved causal graphs that extend infinitely\ninto the past and future and whose edges are repetitive in time, thereby\nreflecting the assumption of stationary causal relationships. However, most\nresults and algorithms from the causal-graphical-model framework are not\ndesigned for infinite graphs. In this work, we develop a method for projecting\ninfinite time series graphs with repetitive edges to marginal graphical models\non a finite time window. These finite marginal graphs provide the answers to\n$m$-separation queries with respect to the infinite graph, a task that was\npreviously unresolved. Moreover, we argue that these marginal graphs are useful\nfor causal discovery and causal effect estimation in time series, effectively\nenabling to apply results developed for finite graphs to the infinite graphs.\nThe projection procedure relies on finding common ancestors in the\nto-be-projected graph and is, by itself, not new. However, the projection\nprocedure has not yet been algorithmically implemented for time series graphs\nsince in these infinite graphs there can be infinite sets of paths that might\ngive rise to common ancestors. We solve the search over these possibly infinite\nsets of paths by an intriguing combination of path-finding techniques for\nfinite directed graphs and solution theory for linear Diophantine equations. By\nproviding an algorithm that carries out the projection, our paper makes an\nimportant step towards a theoretically-grounded and method-agnostic\ngeneralization of a range of causal inference methods and results to time\nseries."}, "http://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "http://arxiv.org/abs/2310.05539", "description": "In response to the unique challenge created by high-dimensional mediators in\nmediation analysis, this paper presents a novel procedure for testing the\nnullity of the mediation effect in the presence of high-dimensional mediators.\nThe procedure incorporates two distinct features. Firstly, the test remains\nvalid under all cases of the composite null hypothesis, including the\nchallenging scenario where both exposure-mediator and mediator-outcome\ncoefficients are zero. Secondly, it does not impose structural assumptions on\nthe exposure-mediator coefficients, thereby allowing for an arbitrarily strong\nexposure-mediator relationship. To the best of our knowledge, the proposed test\nis the first of its kind to provably possess these two features in\nhigh-dimensional mediation analysis. The validity and consistency of the\nproposed test are established, and its numerical performance is showcased\nthrough simulation studies. The application of the proposed test is\ndemonstrated by examining the mediation effect of DNA methylation between\nsmoking status and lung cancer development."}, "http://arxiv.org/abs/2310.05548": {"title": "Cokrig-and-Regress for Spatially Misaligned Environmental Data", "link": "http://arxiv.org/abs/2310.05548", "description": "Spatially misaligned data, where the response and covariates are observed at\ndifferent spatial locations, commonly arise in many environmental studies. Much\nof the statistical literature on handling spatially misaligned data has been\ndevoted to the case of a single covariate and a linear relationship between the\nresponse and this covariate. Motivated by spatially misaligned data collected\non air pollution and weather in China, we propose a cokrig-and-regress (CNR)\nmethod to estimate spatial regression models involving multiple covariates and\npotentially non-linear associations. The CNR estimator is constructed by\nreplacing the unobserved covariates (at the response locations) by their\ncokriging predictor derived from the observed but misaligned covariates under a\nmultivariate Gaussian assumption, where a generalized Kronecker product\ncovariance is used to account for spatial correlations within and between\ncovariates. A parametric bootstrap approach is employed to bias-correct the CNR\nestimates of the spatial covariance parameters and for uncertainty\nquantification. Simulation studies demonstrate that CNR outperforms several\nexisting methods for handling spatially misaligned data, such as\nnearest-neighbor interpolation. Applying CNR to the spatially misaligned air\npollution and weather data in China reveals a number of non-linear\nrelationships between PM$_{2.5}$ concentration and several meteorological\ncovariates."}, "http://arxiv.org/abs/2310.05622": {"title": "A neutral comparison of statistical methods for time-to-event analyses under non-proportional hazards", "link": "http://arxiv.org/abs/2310.05622", "description": "While well-established methods for time-to-event data are available when the\nproportional hazards assumption holds, there is no consensus on the best\ninferential approach under non-proportional hazards (NPH). However, a wide\nrange of parametric and non-parametric methods for testing and estimation in\nthis scenario have been proposed. To provide recommendations on the statistical\nanalysis of clinical trials where non proportional hazards are expected, we\nconducted a comprehensive simulation study under different scenarios of\nnon-proportional hazards, including delayed onset of treatment effect, crossing\nhazard curves, subgroups with different treatment effect and changing hazards\nafter disease progression. We assessed type I error rate control, power and\nconfidence interval coverage, where applicable, for a wide range of methods\nincluding weighted log-rank tests, the MaxCombo test, summary measures such as\nthe restricted mean survival time (RMST), average hazard ratios, and milestone\nsurvival probabilities as well as accelerated failure time regression models.\nWe found a trade-off between interpretability and power when choosing an\nanalysis strategy under NPH scenarios. While analysis methods based on weighted\nlogrank tests typically were favorable in terms of power, they do not provide\nan easily interpretable treatment effect estimate. Also, depending on the\nweight function, they test a narrow null hypothesis of equal hazard functions\nand rejection of this null hypothesis may not allow for a direct conclusion of\ntreatment benefit in terms of the survival function. In contrast,\nnon-parametric procedures based on well interpretable measures as the RMST\ndifference had lower power in most scenarios. Model based methods based on\nspecific survival distributions had larger power, however often gave biased\nestimates and lower than nominal confidence interval coverage."}, "http://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "http://arxiv.org/abs/2310.05646", "description": "We study transfer learning in the context of estimating piecewise-constant\nsignals when source data, which may be relevant but disparate, are available in\naddition to the target data. We initially investigate transfer learning\nestimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for\nunisource data scenarios and then generalise these estimators to accommodate\nmultisource data. To further reduce estimation errors, especially in scenarios\nwhere some sources significantly differ from the target, we introduce an\ninformative source selection algorithm. We then examine these estimators with\nmultisource selection and establish their minimax optimality under specific\nregularity conditions. It is worth emphasising that, unlike the prevalent\nnarrative in the transfer learning literature that the performance is enhanced\nthrough large source sample sizes, our approaches leverage higher observation\nfrequencies and accommodate diverse frequencies across multiple sources. Our\ntheoretical findings are empirically validated through extensive numerical\nexperiments, with the code available online, see\nhttps://github.com/chrisfanwang/transferlearning"}, "http://arxiv.org/abs/2310.05685": {"title": "Post-Selection Inference for Sparse Estimation", "link": "http://arxiv.org/abs/2310.05685", "description": "When the model is not known and parameter testing or interval estimation is\nconducted after model selection, it is necessary to consider selective\ninference. This paper discusses this issue in the context of sparse estimation.\nFirstly, we describe selective inference related to Lasso as per \\cite{lee},\nand then present polyhedra and truncated distributions when applying it to\nmethods such as Forward Stepwise and LARS. Lastly, we discuss the Significance\nTest for Lasso by \\cite{significant} and the Spacing Test for LARS by\n\\cite{ryan_exact}. This paper serves as a review article.\n\nKeywords: post-selective inference, polyhedron, LARS, lasso, forward\nstepwise, significance test, spacing test."}, "http://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "http://arxiv.org/abs/2310.05921", "description": "We introduce Conformal Decision Theory, a framework for producing safe\nautonomous decisions despite imperfect machine learning predictions. Examples\nof such decisions are ubiquitous, from robot planning algorithms that rely on\npedestrian predictions, to calibrating autonomous manufacturing to exhibit high\nthroughput and low error, to the choice of trusting a nominal policy versus\nswitching to a safe backup policy at run-time. The decisions produced by our\nalgorithms are safe in the sense that they come with provable statistical\nguarantees of having low risk without any assumptions on the world model\nwhatsoever; the observations need not be I.I.D. and can even be adversarial.\nThe theory extends results from conformal prediction to calibrate decisions\ndirectly, without requiring the construction of prediction sets. Experiments\ndemonstrate the utility of our approach in robot motion planning around humans,\nautomated stock trading, and robot manufacturin"}, "http://arxiv.org/abs/2101.06950": {"title": "Learning and scoring Gaussian latent variable causal models with unknown additive interventions", "link": "http://arxiv.org/abs/2101.06950", "description": "With observational data alone, causal structure learning is a challenging\nproblem. The task becomes easier when having access to data collected from\nperturbations of the underlying system, even when the nature of these is\nunknown. Existing methods either do not allow for the presence of latent\nvariables or assume that these remain unperturbed. However, these assumptions\nare hard to justify if the nature of the perturbations is unknown. We provide\nresults that enable scoring causal structures in the setting with additive, but\nunknown interventions. Specifically, we propose a maximum-likelihood estimator\nin a structural equation model that exploits system-wide invariances to output\nan equivalence class of causal structures from perturbation data. Furthermore,\nunder certain structural assumptions on the population model, we provide a\nsimple graphical characterization of all the DAGs in the interventional\nequivalence class. We illustrate the utility of our framework on synthetic data\nas well as real data involving California reservoirs and protein expressions.\nThe software implementation is available as the Python package \\emph{utlvce}."}, "http://arxiv.org/abs/2107.14151": {"title": "Modern Non-Linear Function-on-Function Regression", "link": "http://arxiv.org/abs/2107.14151", "description": "We introduce a new class of non-linear function-on-function regression models\nfor functional data using neural networks. We propose a framework using a\nhidden layer consisting of continuous neurons, called a continuous hidden\nlayer, for functional response modeling and give two model fitting strategies,\nFunctional Direct Neural Network (FDNN) and Functional Basis Neural Network\n(FBNN). Both are designed explicitly to exploit the structure inherent in\nfunctional data and capture the complex relations existing between the\nfunctional predictors and the functional response. We fit these models by\nderiving functional gradients and implement regularization techniques for more\nparsimonious results. We demonstrate the power and flexibility of our proposed\nmethod in handling complex functional models through extensive simulation\nstudies as well as real data examples."}, "http://arxiv.org/abs/2112.00832": {"title": "On the mixed-model analysis of covariance in cluster-randomized trials", "link": "http://arxiv.org/abs/2112.00832", "description": "In the analyses of cluster-randomized trials, mixed-model analysis of\ncovariance (ANCOVA) is a standard approach for covariate adjustment and\nhandling within-cluster correlations. However, when the normality, linearity,\nor the random-intercept assumption is violated, the validity and efficiency of\nthe mixed-model ANCOVA estimators for estimating the average treatment effect\nremain unclear. Under the potential outcomes framework, we prove that the\nmixed-model ANCOVA estimators for the average treatment effect are consistent\nand asymptotically normal under arbitrary misspecification of its working\nmodel. If the probability of receiving treatment is 0.5 for each cluster, we\nfurther show that the model-based variance estimator under mixed-model ANCOVA1\n(ANCOVA without treatment-covariate interactions) remains consistent,\nclarifying that the confidence interval given by standard software is\nasymptotically valid even under model misspecification. Beyond robustness, we\ndiscuss several insights on precision among classical methods for analyzing\ncluster-randomized trials, including the mixed-model ANCOVA, individual-level\nANCOVA, and cluster-level ANCOVA estimators. These insights may inform the\nchoice of methods in practice. Our analytical results and insights are\nillustrated via simulation studies and analyses of three cluster-randomized\ntrials."}, "http://arxiv.org/abs/2201.10770": {"title": "Confidence intervals for the Cox model test error from cross-validation", "link": "http://arxiv.org/abs/2201.10770", "description": "Cross-validation (CV) is one of the most widely used techniques in\nstatistical learning for estimating the test error of a model, but its behavior\nis not yet fully understood. It has been shown that standard confidence\nintervals for test error using estimates from CV may have coverage below\nnominal levels. This phenomenon occurs because each sample is used in both the\ntraining and testing procedures during CV and as a result, the CV estimates of\nthe errors become correlated. Without accounting for this correlation, the\nestimate of the variance is smaller than it should be. One way to mitigate this\nissue is by estimating the mean squared error of the prediction error instead\nusing nested CV. This approach has been shown to achieve superior coverage\ncompared to intervals derived from standard CV. In this work, we generalize the\nnested CV idea to the Cox proportional hazards model and explore various\nchoices of test error for this setting."}, "http://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2202.08419", "description": "In this paper, we develop a novel high-dimensional time-varying coefficient\nestimation method, based on high-dimensional Ito diffusion processes. To\naccount for high-dimensional time-varying coefficients, we first estimate local\n(or instantaneous) coefficients using a time-localized Dantzig selection scheme\nunder a sparsity condition, which results in biased local coefficient\nestimators due to the regularization. To handle the bias, we propose a\ndebiasing scheme, which provides well-performing unbiased local coefficient\nestimators. With the unbiased local coefficient estimators, we estimate the\nintegrated coefficient, and to further account for the sparsity of the\ncoefficient process, we apply thresholding schemes. We call this Thresholding\ndEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED\nestimator. In the empirical analysis, we apply the TED procedure to analyzing\nhigh-dimensional factor models using high-frequency data."}, "http://arxiv.org/abs/2206.12525": {"title": "Causality of Functional Longitudinal Data", "link": "http://arxiv.org/abs/2206.12525", "description": "\"Treatment-confounder feedback\" is the central complication to resolve in\nlongitudinal studies, to infer causality. The existing frameworks for\nidentifying causal effects for longitudinal studies with discrete repeated\nmeasures hinge heavily on assuming that time advances in discrete time steps or\ntreatment changes as a jumping process, rendering the number of \"feedbacks\"\nfinite. However, medical studies nowadays with real-time monitoring involve\nfunctional time-varying outcomes, treatment, and confounders, which leads to an\nuncountably infinite number of feedbacks between treatment and confounders.\nTherefore more general and advanced theory is needed. We generalize the\ndefinition of causal effects under user-specified stochastic treatment regimes\nto longitudinal studies with continuous monitoring and develop an\nidentification framework, allowing right censoring and truncation by death. We\nprovide sufficient identification assumptions including a generalized\nconsistency assumption, a sequential randomization assumption, a positivity\nassumption, and a novel \"intervenable\" assumption designed for the\ncontinuous-time case. Under these assumptions, we propose a g-computation\nprocess and an inverse probability weighting process, which suggest a\ng-computation formula and an inverse probability weighting formula for\nidentification. For practical purposes, we also construct two classes of\npopulation estimating equations to identify these two processes, respectively,\nwhich further suggest a doubly robust identification formula with extra\nrobustness against process misspecification. We prove that our framework fully\ngeneralize the existing frameworks and is nonparametric."}, "http://arxiv.org/abs/2209.08139": {"title": "Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2209.08139", "description": "Bayesian variable selection methods are powerful techniques for fitting and\ninferring on sparse high-dimensional linear regression models. However, many\nare computationally intensive or require restrictive prior distributions on\nmodel parameters. In this paper, we proposed a computationally efficient and\npowerful Bayesian approach for sparse high-dimensional linear regression.\nMinimal prior assumptions on the parameters are required through the use of\nplug-in empirical Bayes estimates of hyperparameters. Efficient maximum a\nposteriori (MAP) estimation is completed through a Parameter-Expanded\nExpectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in\na robust computationally efficient coordinate-wise optimization which -- when\nupdating the coefficient for a particular predictor -- adjusts for the impact\nof other predictor variables. The completion of the E-step uses an approach\nmotivated by the popular two-group approach to multiple testing. The result is\na PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse\nhigh-dimensional linear regression, which can be completed using one-at-a-time\nor all-at-once type optimization. We compare the empirical properties of PROBE\nto comparable approaches with numerous simulation studies and analyses of\ncancer cell drug responses. The proposed approach is implemented in the R\npackage probe."}, "http://arxiv.org/abs/2212.02709": {"title": "SURE-tuned Bridge Regression", "link": "http://arxiv.org/abs/2212.02709", "description": "Consider the {$\\ell_{\\alpha}$} regularized linear regression, also termed\nBridge regression. For $\\alpha\\in (0,1)$, Bridge regression enjoys several\nstatistical properties of interest such as sparsity and near-unbiasedness of\nthe estimates (Fan and Li, 2001). However, the main difficulty lies in the\nnon-convex nature of the penalty for these values of $\\alpha$, which makes an\noptimization procedure challenging and usually it is only possible to find a\nlocal optimum. To address this issue, Polson et al. (2013) took a sampling\nbased fully Bayesian approach to this problem, using the correspondence between\nthe Bridge penalty and a power exponential prior on the regression\ncoefficients. However, their sampling procedure relies on Markov chain Monte\nCarlo (MCMC) techniques, which are inherently sequential and not scalable to\nlarge problem dimensions. Cross validation approaches are similarly\ncomputation-intensive. To this end, our contribution is a novel\n\\emph{non-iterative} method to fit a Bridge regression model. The main\ncontribution lies in an explicit formula for Stein's unbiased risk estimate for\nthe out of sample prediction risk of Bridge regression, which can then be\noptimized to select the desired tuning parameters, allowing us to completely\nbypass MCMC as well as computation-intensive cross validation approaches. Our\nprocedure yields results in a fraction of computational times compared to\niterative schemes, without any appreciable loss in statistical performance. An\nR implementation is publicly available online at:\nhttps://github.com/loriaJ/Sure-tuned_BridgeRegression ."}, "http://arxiv.org/abs/2212.03122": {"title": "Robust convex biclustering with a tuning-free method", "link": "http://arxiv.org/abs/2212.03122", "description": "Biclustering is widely used in different kinds of fields including gene\ninformation analysis, text mining, and recommendation system by effectively\ndiscovering the local correlation between samples and features. However, many\nbiclustering algorithms will collapse when facing heavy-tailed data. In this\npaper, we propose a robust version of convex biclustering algorithm with Huber\nloss. Yet, the newly introduced robustification parameter brings an extra\nburden to selecting the optimal parameters. Therefore, we propose a tuning-free\nmethod for automatically selecting the optimal robustification parameter with\nhigh efficiency. The simulation study demonstrates the more fabulous\nperformance of our proposed method than traditional biclustering methods when\nencountering heavy-tailed noise. A real-life biomedical application is also\npresented. The R package RcvxBiclustr is available at\nhttps://github.com/YifanChen3/RcvxBiclustr."}, "http://arxiv.org/abs/2301.09661": {"title": "Estimating marginal treatment effects from observational studies and indirect treatment comparisons: When are standardization-based methods preferable to those based on propensity score weighting?", "link": "http://arxiv.org/abs/2301.09661", "description": "In light of newly developed standardization methods, we evaluate, via\nsimulation study, how propensity score weighting and standardization -based\napproaches compare for obtaining estimates of the marginal odds ratio and the\nmarginal hazard ratio. Specifically, we consider how the two approaches compare\nin two different scenarios: (1) in a single observational study, and (2) in an\nanchored indirect treatment comparison (ITC) of randomized controlled trials.\nWe present the material in such a way so that the matching-adjusted indirect\ncomparison (MAIC) and the (novel) simulated treatment comparison (STC) methods\nin the ITC setting may be viewed as analogous to the propensity score weighting\nand standardization methods in the single observational study setting. Our\nresults suggest that current recommendations for conducting ITCs can be\nimproved and underscore the importance of adjusting for purely prognostic\nfactors."}, "http://arxiv.org/abs/2302.11746": {"title": "Logistic Regression and Classification with non-Euclidean Covariates", "link": "http://arxiv.org/abs/2302.11746", "description": "We introduce a logistic regression model for data pairs consisting of a\nbinary response and a covariate residing in a non-Euclidean metric space\nwithout vector structures. Based on the proposed model we also develop a binary\nclassifier for non-Euclidean objects. We propose a maximum likelihood estimator\nfor the non-Euclidean regression coefficient in the model, and provide upper\nbounds on the estimation error under various metric entropy conditions that\nquantify complexity of the underlying metric space. Matching lower bounds are\nderived for the important metric spaces commonly seen in statistics,\nestablishing optimality of the proposed estimator in such spaces. Similarly, an\nupper bound on the excess risk of the developed classifier is provided for\ngeneral metric spaces. A finer upper bound and a matching lower bound, and thus\noptimality of the proposed classifier, are established for Riemannian\nmanifolds. We investigate the numerical performance of the proposed estimator\nand classifier via simulation studies, and illustrate their practical merits\nvia an application to task-related fMRI data."}, "http://arxiv.org/abs/2302.13658": {"title": "Robust High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2302.13658", "description": "In this paper, we develop a novel high-dimensional coefficient estimation\nprocedure based on high-frequency data. Unlike usual high-dimensional\nregression procedure such as LASSO, we additionally handle the heavy-tailedness\nof high-frequency observations as well as time variations of coefficient\nprocesses. Specifically, we employ Huber loss and truncation scheme to handle\nheavy-tailed observations, while $\\ell_{1}$-regularization is adopted to\novercome the curse of dimensionality. To account for the time-varying\ncoefficient, we estimate local coefficients which are biased due to the\n$\\ell_{1}$-regularization. Thus, when estimating integrated coefficients, we\npropose a debiasing scheme to enjoy the law of large number property and employ\na thresholding scheme to further accommodate the sparsity of the coefficients.\nWe call this Robust thrEsholding Debiased LASSO (RED-LASSO) estimator. We show\nthat the RED-LASSO estimator can achieve a near-optimal convergence rate. In\nthe empirical study, we apply the RED-LASSO procedure to the high-dimensional\nintegrated coefficient estimation using high-frequency trading data."}, "http://arxiv.org/abs/2307.04754": {"title": "Action-State Dependent Dynamic Model Selection", "link": "http://arxiv.org/abs/2307.04754", "description": "A model among many may only be best under certain states of the world.\nSwitching from a model to another can also be costly. Finding a procedure to\ndynamically choose a model in these circumstances requires to solve a complex\nestimation procedure and a dynamic programming problem. A Reinforcement\nlearning algorithm is used to approximate and estimate from the data the\noptimal solution to this dynamic programming problem. The algorithm is shown to\nconsistently estimate the optimal policy that may choose different models based\non a set of covariates. A typical example is the one of switching between\ndifferent portfolio models under rebalancing costs, using macroeconomic\ninformation. Using a set of macroeconomic variables and price data, an\nempirical application to the aforementioned portfolio problem shows superior\nperformance to choosing the best portfolio model with hindsight."}, "http://arxiv.org/abs/2307.14828": {"title": "Identifying regime switches through Bayesian wavelet estimation: evidence from flood detection in the Taquari River Valley", "link": "http://arxiv.org/abs/2307.14828", "description": "Two-component mixture models have proved to be a powerful tool for modeling\nheterogeneity in several cluster analysis contexts. However, most methods based\non these models assume a constant behavior for the mixture weights, which can\nbe restrictive and unsuitable for some applications. In this paper, we relax\nthis assumption and allow the mixture weights to vary according to the index\n(e.g., time) to make the model more adaptive to a broader range of data sets.\nWe propose an efficient MCMC algorithm to jointly estimate both component\nparameters and dynamic weights from their posterior samples. We evaluate the\nmethod's performance by running Monte Carlo simulation studies under different\nscenarios for the dynamic weights. In addition, we apply the algorithm to a\ntime series that records the level reached by a river in southern Brazil. The\nTaquari River is a water body whose frequent flood inundations have caused\nvarious damage to riverside communities. Implementing a dynamic mixture model\nallows us to properly describe the flood regimes for the areas most affected by\nthese phenomena."}, "http://arxiv.org/abs/2310.06130": {"title": "Statistical inference for radially-stable generalized Pareto distributions and return level-sets in geometric extremes", "link": "http://arxiv.org/abs/2310.06130", "description": "We obtain a functional analogue of the quantile function for probability\nmeasures admitting a continuous Lebesgue density on $\\mathbb{R}^d$, and use it\nto characterize the class of non-trivial limit distributions of radially\nrecentered and rescaled multivariate exceedances in geometric extremes. A new\nclass of multivariate distributions is identified, termed radially stable\ngeneralized Pareto distributions, and is shown to admit certain stability\nproperties that permit extrapolation to extremal sets along any direction in\n$\\mathbb{R}^d$. Based on the limit Poisson point process likelihood of the\nradially renormalized point process of exceedances, we develop parsimonious\nstatistical models that exploit theoretical links between structural\nstar-bodies and are amenable to Bayesian inference. The star-bodies determine\nthe mean measure of the limit Poisson process through a hierarchical structure.\nOur framework sharpens statistical inference by suitably including additional\ninformation from the angular directions of the geometric exceedances and\nfacilitates efficient computations in dimensions $d=2$ and $d=3$. Additionally,\nit naturally leads to the notion of the return level-set, which is a canonical\nquantile set expressed in terms of its average recurrence interval, and a\ngeometric analogue of the uni-dimensional return level. We illustrate our\nmethods with a simulation study showing superior predictive performance of\nprobabilities of rare events, and with two case studies, one associated with\nriver flow extremes, and the other with oceanographic extremes."}, "http://arxiv.org/abs/2310.06242": {"title": "Treatment Choice, Mean Square Regret and Partial Identification", "link": "http://arxiv.org/abs/2310.06242", "description": "We consider a decision maker who faces a binary treatment choice when their\nwelfare is only partially identified from data. We contribute to the literature\nby anchoring our finite-sample analysis on mean square regret, a decision\ncriterion advocated by Kitagawa, Lee, and Qiu (2022). We find that optimal\nrules are always fractional, irrespective of the width of the identified set\nand precision of its estimate. The optimal treatment fraction is a simple\nlogistic transformation of the commonly used t-statistic multiplied by a factor\ncalculated by a simple constrained optimization. This treatment fraction gets\ncloser to 0.5 as the width of the identified set becomes wider, implying the\ndecision maker becomes more cautious against the adversarial Nature."}, "http://arxiv.org/abs/2310.06252": {"title": "Power and sample size calculation of two-sample projection-based testing for sparsely observed functional data", "link": "http://arxiv.org/abs/2310.06252", "description": "Projection-based testing for mean trajectory differences in two groups of\nirregularly and sparsely observed functional data has garnered significant\nattention in the literature because it accommodates a wide spectrum of group\ndifferences and (non-stationary) covariance structures. This article presents\nthe derivation of the theoretical power function and the introduction of a\ncomprehensive power and sample size (PASS) calculation toolkit tailored to the\nprojection-based testing method developed by Wang (2021). Our approach\naccommodates a wide spectrum of group difference scenarios and a broad class of\ncovariance structures governing the underlying processes. Through extensive\nnumerical simulation, we demonstrate the robustness of this testing method by\nshowcasing that its statistical power remains nearly unaffected even when a\ncertain percentage of observations are missing, rendering it 'missing-immune'.\nFurthermore, we illustrate the practical utility of this test through analysis\nof two randomized controlled trials of Parkinson's disease. To facilitate\nimplementation, we provide a user-friendly R package fPASS, complete with a\ndetailed vignette to guide users through its practical application. We\nanticipate that this article will significantly enhance the usability of this\npotent statistical tool across a range of biostatistical applications, with a\nparticular focus on its relevance in the design of clinical trials."}, "http://arxiv.org/abs/2310.06315": {"title": "Ultra-high dimensional confounder selection algorithms comparison with application to radiomics data", "link": "http://arxiv.org/abs/2310.06315", "description": "Radiomics is an emerging area of medical imaging data analysis particularly\nfor cancer. It involves the conversion of digital medical images into mineable\nultra-high dimensional data. Machine learning algorithms are widely used in\nradiomics data analysis to develop powerful decision support model to improve\nprecision in diagnosis, assessment of prognosis and prediction of therapy\nresponse. However, machine learning algorithms for causal inference have not\nbeen previously employed in radiomics analysis. In this paper, we evaluate the\nvalue of machine learning algorithms for causal inference in radiomics. We\nselect three recent competitive variable selection algorithms for causal\ninference: outcome-adaptive lasso (OAL), generalized outcome-adaptive lasso\n(GOAL) and causal ball screening (CBS). We used a sure independence screening\nprocedure to propose an extension of GOAL and OAL for ultra-high dimensional\ndata, SIS + GOAL and SIS + OAL. We compared SIS + GOAL, SIS + OAL and CBS using\nsimulation study and two radiomics datasets in cancer, osteosarcoma and\ngliosarcoma. The two radiomics studies and the simulation study identified SIS\n+ GOAL as the optimal variable selection algorithm."}, "http://arxiv.org/abs/2310.06330": {"title": "Multivariate moment least-squares estimators for reversible Markov chains", "link": "http://arxiv.org/abs/2310.06330", "description": "Markov chain Monte Carlo (MCMC) is a commonly used method for approximating\nexpectations with respect to probability distributions. Uncertainty assessment\nfor MCMC estimators is essential in practical applications. Moreover, for\nmultivariate functions of a Markov chain, it is important to estimate not only\nthe auto-correlation for each component but also to estimate\ncross-correlations, in order to better assess sample quality, improve estimates\nof effective sample size, and use more effective stopping rules. Berg and Song\n[2022] introduced the moment least squares (momentLS) estimator, a\nshape-constrained estimator for the autocovariance sequence from a reversible\nMarkov chain, for univariate functions of the Markov chain. Based on this\nsequence estimator, they proposed an estimator of the asymptotic variance of\nthe sample mean from MCMC samples. In this study, we propose novel\nautocovariance sequence and asymptotic variance estimators for Markov chain\nfunctions with multiple components, based on the univariate momentLS estimators\nfrom Berg and Song [2022]. We demonstrate strong consistency of the proposed\nauto(cross)-covariance sequence and asymptotic variance matrix estimators. We\nconduct empirical comparisons of our method with other state-of-the-art\napproaches on simulated and real-data examples, using popular samplers\nincluding the random-walk Metropolis sampler and the No-U-Turn sampler from\nSTAN."}, "http://arxiv.org/abs/2310.06357": {"title": "Adaptive Storey's null proportion estimator", "link": "http://arxiv.org/abs/2310.06357", "description": "False discovery rate (FDR) is a commonly used criterion in multiple testing\nand the Benjamini-Hochberg (BH) procedure is arguably the most popular approach\nwith FDR guarantee. To improve power, the adaptive BH procedure has been\nproposed by incorporating various null proportion estimators, among which\nStorey's estimator has gained substantial popularity. The performance of\nStorey's estimator hinges on a critical hyper-parameter, where a pre-fixed\nconfiguration lacks power and existing data-driven hyper-parameters compromise\nthe FDR control. In this work, we propose a novel class of adaptive\nhyper-parameters and establish the FDR control of the associated BH procedure\nusing a martingale argument. Within this class of data-driven hyper-parameters,\nwe present a specific configuration designed to maximize the number of\nrejections and characterize the convergence of this proposal to the optimal\nhyper-parameter under a commonly-used mixture model. We evaluate our adaptive\nStorey's null proportion estimator and the associated BH procedure on extensive\nsimulated data and a motivating protein dataset. Our proposal exhibits\nsignificant power gains when dealing with a considerable proportion of weak\nnon-nulls or a conservative null distribution."}, "http://arxiv.org/abs/2310.06467": {"title": "Advances in Kth nearest-neighbour clutter removal", "link": "http://arxiv.org/abs/2310.06467", "description": "We consider the problem of feature detection in the presence of clutter in\nspatial point processes. Classification methods have been developed in previous\nstudies. Among these, Byers and Raftery (1998) models the observed Kth nearest\nneighbour distances as a mixture distribution and classifies the clutter and\nfeature points consequently. In this paper, we enhance such approach in two\nmanners. First, we propose an automatic procedure for selecting the number of\nnearest neighbours to consider in the classification method by means of\nsegmented regression models. Secondly, with the aim of applying the procedure\nmultiple times to get a ``better\" end result, we propose a stopping criterion\nthat minimizes the overall entropy measure of cluster separation between\nclutter and feature points. The proposed procedures are suitable for a feature\nwith clutter as two superimposed Poisson processes on any space, including\nlinear networks. We present simulations and two case studies of environmental\ndata to illustrate the method."}, "http://arxiv.org/abs/2310.06533": {"title": "Multilevel Monte Carlo for a class of Partially Observed Processes in Neuroscience", "link": "http://arxiv.org/abs/2310.06533", "description": "In this paper we consider Bayesian parameter inference associated to a class\nof partially observed stochastic differential equations (SDE) driven by jump\nprocesses. Such type of models can be routinely found in applications, of which\nwe focus upon the case of neuroscience. The data are assumed to be observed\nregularly in time and driven by the SDE model with unknown parameters. In\npractice the SDE may not have an analytically tractable solution and this leads\nnaturally to a time-discretization. We adapt the multilevel Markov chain Monte\nCarlo method of [11], which works with a hierarchy of time discretizations and\nshow empirically and theoretically that this is preferable to using one single\ntime discretization. The improvement is in terms of the computational cost\nneeded to obtain a pre-specified numerical error. Our approach is illustrated\non models that are found in neuroscience."}, "http://arxiv.org/abs/2310.06653": {"title": "Evaluating causal effects on time-to-event outcomes in an RCT in Oncology with treatment discontinuation due to adverse events", "link": "http://arxiv.org/abs/2310.06653", "description": "In clinical trials, patients sometimes discontinue study treatments\nprematurely due to reasons such as adverse events. Treatment discontinuation\noccurs after the randomisation as an intercurrent event, making causal\ninference more challenging. The Intention-To-Treat (ITT) analysis provides\nvalid causal estimates of the effect of treatment assignment; still, it does\nnot take into account whether or not patients had to discontinue the treatment\nprematurely. We propose to deal with the problem of treatment discontinuation\nusing principal stratification, recognised in the ICH E9(R1) addendum as a\nstrategy for handling intercurrent events. Under this approach, we can\ndecompose the overall ITT effect into principal causal effects for groups of\npatients defined by their potential discontinuation behaviour in continuous\ntime. In this framework, we must consider that discontinuation happening in\ncontinuous time generates an infinite number of principal strata and that\ndiscontinuation time is not defined for patients who would never discontinue.\nAn additional complication is that discontinuation time and time-to-event\noutcomes are subject to administrative censoring. We employ a flexible\nmodel-based Bayesian approach to deal with such complications. We apply the\nBayesian principal stratification framework to analyse synthetic data based on\na recent RCT in Oncology, aiming to assess the causal effects of a new\ninvestigational drug combined with standard of care vs. standard of care alone\non progression-free survival. We simulate data under different assumptions that\nreflect real situations where patients' behaviour depends on critical baseline\ncovariates. Finally, we highlight how such an approach makes it straightforward\nto characterise patients' discontinuation behaviour with respect to the\navailable covariates with the help of a simulation study."}, "http://arxiv.org/abs/2310.06673": {"title": "Assurance Methods for designing a clinical trial with a delayed treatment effect", "link": "http://arxiv.org/abs/2310.06673", "description": "An assurance calculation is a Bayesian alternative to a power calculation.\nOne may be performed to aid the planning of a clinical trial, specifically\nsetting the sample size or to support decisions about whether or not to perform\na study. Immuno-oncology (IO) is a rapidly evolving area in the development of\nanticancer drugs. A common phenomenon that arises from IO trials is one of\ndelayed treatment effects, that is, there is a delay in the separation of the\nsurvival curves. To calculate assurance for a trial in which a delayed\ntreatment effect is likely to be present, uncertainty about key parameters\nneeds to be considered. If uncertainty is not considered, then the number of\npatients recruited may not be enough to ensure we have adequate statistical\npower to detect a clinically relevant treatment effect. We present a new\nelicitation technique for when a delayed treatment effect is likely to be\npresent and show how to compute assurance using these elicited prior\ndistributions. We provide an example to illustrate how this could be used in\npractice. Open-source software is provided for implementing our methods. Our\nmethodology makes the benefits of assurance methods available for the planning\nof IO trials (and others where a delayed treatment expect is likely to occur)."}, "http://arxiv.org/abs/2310.06696": {"title": "Variable selection with FDR control for noisy data -- an application to screening metabolites that are associated with breast and colorectal cancer", "link": "http://arxiv.org/abs/2310.06696", "description": "The rapidly expanding field of metabolomics presents an invaluable resource\nfor understanding the associations between metabolites and various diseases.\nHowever, the high dimensionality, presence of missing values, and measurement\nerrors associated with metabolomics data can present challenges in developing\nreliable and reproducible methodologies for disease association studies.\nTherefore, there is a compelling need to develop robust statistical methods\nthat can navigate these complexities to achieve reliable and reproducible\ndisease association studies. In this paper, we focus on developing such a\nmethodology with an emphasis on controlling the False Discovery Rate during the\nscreening of mutual metabolomic signals for multiple disease outcomes. We\nillustrate the versatility and performance of this procedure in a variety of\nscenarios, dealing with missing data and measurement errors. As a specific\napplication of this novel methodology, we target two of the most prevalent\ncancers among US women: breast cancer and colorectal cancer. By applying our\nmethod to the Wome's Health Initiative data, we successfully identify\nmetabolites that are associated with either or both of these cancers,\ndemonstrating the practical utility and potential of our method in identifying\nconsistent risk factors and understanding shared mechanisms between diseases."}, "http://arxiv.org/abs/2310.06708": {"title": "Adjustment with Three Continuous Variables", "link": "http://arxiv.org/abs/2310.06708", "description": "Spurious association between X and Y may be due to a confounding variable W.\nStatisticians may adjust for W using a variety of techniques. This paper\npresents the results of simulations conducted to assess the performance of\nthose techniques under various, elementary, data-generating processes. The\nresults indicate that no technique is best overall and that specific techniques\nshould be selected based on the particulars of the data-generating process.\nHere we show how causal graphs can guide the selection or design of techniques\nfor statistical adjustment. R programs are provided for researchers interested\nin generalization."}, "http://arxiv.org/abs/2310.06720": {"title": "Asymptotic theory for Bayesian inference and prediction: from the ordinary to a conditional Peaks-Over-Threshold method", "link": "http://arxiv.org/abs/2310.06720", "description": "The Peaks Over Threshold (POT) method is the most popular statistical method\nfor the analysis of univariate extremes. Even though there is a rich applied\nliterature on Bayesian inference for the POT method there is no asymptotic\ntheory for such proposals. Even more importantly, the ambitious and challenging\nproblem of predicting future extreme events according to a proper probabilistic\nforecasting approach has received no attention to date. In this paper we\ndevelop the asymptotic theory (consistency, contraction rates, asymptotic\nnormality and asymptotic coverage of credible intervals) for the Bayesian\ninference based on the POT method. We extend such an asymptotic theory to cover\nthe Bayesian inference on the tail properties of the conditional distribution\nof a response random variable conditionally to a vector of random covariates.\nWith the aim to make accurate predictions of severer extreme events than those\noccurred in the past, we specify the posterior predictive distribution of a\nfuture unobservable excess variable in the unconditional and conditional\napproach and we prove that is Wasserstein consistent and derive its contraction\nrates. Simulations show the good performances of the proposed Bayesian\ninferential methods. The analysis of the change in the frequency of financial\ncrises over time shows the utility of our methodology."}, "http://arxiv.org/abs/2310.06730": {"title": "Sparse topic modeling via spectral decomposition and thresholding", "link": "http://arxiv.org/abs/2310.06730", "description": "The probabilistic Latent Semantic Indexing model assumes that the expectation\nof the corpus matrix is low-rank and can be written as the product of a\ntopic-word matrix and a word-document matrix. In this paper, we study the\nestimation of the topic-word matrix under the additional assumption that the\nordered entries of its columns rapidly decay to zero. This sparsity assumption\nis motivated by the empirical observation that the word frequencies in a text\noften adhere to Zipf's law. We introduce a new spectral procedure for\nestimating the topic-word matrix that thresholds words based on their corpus\nfrequencies, and show that its $\\ell_1$-error rate under our sparsity\nassumption depends on the vocabulary size $p$ only via a logarithmic term. Our\nerror bound is valid for all parameter regimes and in particular for the\nsetting where $p$ is extremely large; this high-dimensional setting is commonly\nencountered but has not been adequately addressed in prior literature.\nFurthermore, our procedure also accommodates datasets that violate the\nseparability assumption, which is necessary for most prior approaches in topic\nmodeling. Experiments with synthetic data confirm that our procedure is\ncomputationally fast and allows for consistent estimation of the topic-word\nmatrix in a wide variety of parameter regimes. Our procedure also performs well\nrelative to well-established methods when applied to a large corpus of research\npaper abstracts, as well as the analysis of single-cell and microbiome data\nwhere the same statistical model is relevant but the parameter regimes are\nvastly different."}, "http://arxiv.org/abs/2310.06746": {"title": "Causal Rule Learning: Enhancing the Understanding of Heterogeneous Treatment Effect via Weighted Causal Rules", "link": "http://arxiv.org/abs/2310.06746", "description": "Interpretability is a key concern in estimating heterogeneous treatment\neffects using machine learning methods, especially for healthcare applications\nwhere high-stake decisions are often made. Inspired by the Predictive,\nDescriptive, Relevant framework of interpretability, we propose causal rule\nlearning which finds a refined set of causal rules characterizing potential\nsubgroups to estimate and enhance our understanding of heterogeneous treatment\neffects. Causal rule learning involves three phases: rule discovery, rule\nselection, and rule analysis. In the rule discovery phase, we utilize a causal\nforest to generate a pool of causal rules with corresponding subgroup average\ntreatment effects. The selection phase then employs a D-learning method to\nselect a subset of these rules to deconstruct individual-level treatment\neffects as a linear combination of the subgroup-level effects. This helps to\nanswer an ignored question by previous literature: what if an individual\nsimultaneously belongs to multiple groups with different average treatment\neffects? The rule analysis phase outlines a detailed procedure to further\nanalyze each rule in the subset from multiple perspectives, revealing the most\npromising rules for further validation. The rules themselves, their\ncorresponding subgroup treatment effects, and their weights in the linear\ncombination give us more insights into heterogeneous treatment effects.\nSimulation and real-world data analysis demonstrate the superior performance of\ncausal rule learning on the interpretable estimation of heterogeneous treatment\neffect when the ground truth is complex and the sample size is sufficient."}, "http://arxiv.org/abs/2310.06808": {"title": "Odds are the sign is right", "link": "http://arxiv.org/abs/2310.06808", "description": "This article introduces a new condition based on odds ratios for sensitivity\nanalysis. The analysis involves the average effect of a treatment or exposure\non a response or outcome with estimates adjusted for and conditional on a\nsingle, unmeasured, dichotomous covariate. Results of statistical simulations\nare displayed to show that the odds ratio condition is as reliable as other\ncommonly used conditions for sensitivity analysis. Other conditions utilize\nquantities reflective of a mediating covariate. The odds ratio condition can be\napplied when the covariate is a confounding variable. As an example application\nwe use the odds ratio condition to analyze and interpret a positive association\nobserved between Zika virus infection and birth defects."}, "http://arxiv.org/abs/2009.07551": {"title": "Manipulation-Robust Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2009.07551", "description": "We present a new identification condition for regression discontinuity\ndesigns. We replace the local randomization of Lee (2008) with two restrictions\non its threat, namely, the manipulation of the running variable. Furthermore,\nwe provide the first auxiliary assumption of McCrary's (2008) diagnostic test\nto detect manipulation. Based on our auxiliary assumption, we derive a novel\nexpression of moments that immediately implies the worst-case bounds of Gerard,\nRokkanen, and Rothe (2020) and an enhanced interpretation of their target\nparameters. We highlight two issues: an overlooked source of identification\nfailure, and a missing auxiliary assumption to detect manipulation. In the case\nstudies, we illustrate our solution to these issues using institutional details\nand economic theories."}, "http://arxiv.org/abs/2204.06030": {"title": "Variable importance measures for heterogeneous causal effects", "link": "http://arxiv.org/abs/2204.06030", "description": "The recognition that personalised treatment decisions lead to better clinical\noutcomes has sparked recent research activity in the following two domains.\nPolicy learning focuses on finding optimal treatment rules (OTRs), which\nexpress whether an individual would be better off with or without treatment,\ngiven their measured characteristics. OTRs optimize a pre-set population\ncriterion, but do not provide insight into the extent to which treatment\nbenefits or harms individual subjects. Estimates of conditional average\ntreatment effects (CATEs) do offer such insights, but valid inference is\ncurrently difficult to obtain when data-adaptive methods are used. Moreover,\nclinicians are (rightly) hesitant to blindly adopt OTR or CATE estimates, not\nleast since both may represent complicated functions of patient characteristics\nthat provide little insight into the key drivers of heterogeneity. To address\nthese limitations, we introduce novel nonparametric treatment effect variable\nimportance measures (TE-VIMs). TE-VIMs extend recent regression-VIMs, viewed as\nnonparametric analogues to ANOVA statistics. By not being tied to a particular\nmodel, they are amenable to data-adaptive (machine learning) estimation of the\nCATE, itself an active area of research. Estimators for the proposed statistics\nare derived from their efficient influence curves and these are illustrated\nthrough a simulation study and an applied example."}, "http://arxiv.org/abs/2204.07907": {"title": "Just Identified Indirect Inference Estimator: Accurate Inference through Bias Correction", "link": "http://arxiv.org/abs/2204.07907", "description": "An important challenge in statistical analysis lies in controlling the\nestimation bias when handling the ever-increasing data size and model\ncomplexity of modern data settings. In this paper, we propose a reliable\nestimation and inference approach for parametric models based on the Just\nIdentified iNdirect Inference estimator (JINI). The key advantage of our\napproach is that it allows to construct a consistent estimator in a simple\nmanner, while providing strong bias correction guarantees that lead to accurate\ninference. Our approach is particularly useful for complex parametric models,\nas it allows to bypass the analytical and computational difficulties (e.g., due\nto intractable estimating equation) typically encountered in standard\nprocedures. The properties of JINI (including consistency, asymptotic\nnormality, and its bias correction property) are also studied when the\nparameter dimension is allowed to diverge, which provide the theoretical\nfoundation to explain the advantageous performance of JINI in increasing\ndimensional covariates settings. Our simulations and an alcohol consumption\ndata analysis highlight the practical usefulness and excellent performance of\nJINI when data present features (e.g., misclassification, rounding) as well as\nin robust estimation."}, "http://arxiv.org/abs/2209.05598": {"title": "Learning domain-specific causal discovery from time series", "link": "http://arxiv.org/abs/2209.05598", "description": "Causal discovery (CD) from time-varying data is important in neuroscience,\nmedicine, and machine learning. Techniques for CD encompass randomized\nexperiments, which are generally unbiased but expensive, and algorithms such as\nGranger causality, conditional-independence-based, structural-equation-based,\nand score-based methods that are only accurate under strong assumptions made by\nhuman designers. However, as demonstrated in other areas of machine learning,\nhuman expertise is often not entirely accurate and tends to be outperformed in\ndomains with abundant data. In this study, we examine whether we can enhance\ndomain-specific causal discovery for time series using a data-driven approach.\nOur findings indicate that this procedure significantly outperforms\nhuman-designed, domain-agnostic causal discovery methods, such as Mutual\nInformation, VAR-LiNGAM, and Granger Causality on the MOS 6502 microprocessor,\nthe NetSim fMRI dataset, and the Dream3 gene dataset. We argue that, when\nfeasible, the causality field should consider a supervised approach in which\ndomain-specific CD procedures are learned from extensive datasets with known\ncausal relationships, rather than being designed by human specialists. Our\nfindings promise a new approach toward improving CD in neural and medical data\nand for the broader machine learning community."}, "http://arxiv.org/abs/2209.05795": {"title": "Joint modelling of the body and tail of bivariate data", "link": "http://arxiv.org/abs/2209.05795", "description": "In situations where both extreme and non-extreme data are of interest,\nmodelling the whole data set accurately is important. In a univariate\nframework, modelling the bulk and tail of a distribution has been extensively\nstudied before. However, when more than one variable is of concern, models that\naim specifically at capturing both regions correctly are scarce in the\nliterature. A dependence model that blends two copulas with different\ncharacteristics over the whole range of the data support is proposed. One\ncopula is tailored to the bulk and the other to the tail, with a dynamic\nweighting function employed to transition smoothly between them. Tail\ndependence properties are investigated numerically and simulation is used to\nconfirm that the blended model is sufficiently flexible to capture a wide\nvariety of structures. The model is applied to study the dependence between\ntemperature and ozone concentration at two sites in the UK and compared with a\nsingle copula fit. The proposed model provides a better, more flexible, fit to\nthe data, and is also capable of capturing complex dependence structures."}, "http://arxiv.org/abs/2212.14650": {"title": "Two-step estimators of high dimensional correlation matrices", "link": "http://arxiv.org/abs/2212.14650", "description": "We investigate block diagonal and hierarchical nested stochastic multivariate\nGaussian models by studying their sample cross-correlation matrix on high\ndimensions. By performing numerical simulations, we compare a filtered sample\ncross-correlation with the population cross-correlation matrices by using\nseveral rotationally invariant estimators (RIE) and hierarchical clustering\nestimators (HCE) under several loss functions. We show that at large but finite\nsample size, sample cross-correlation filtered by RIE estimators are often\noutperformed by HCE estimators for several of the loss functions. We also show\nthat for block models and for hierarchically nested block models the best\ndetermination of the filtered sample cross-correlation is achieved by\nintroducing two-step estimators combining state-of-the-art non-linear shrinkage\nmodels with hierarchical clustering estimators."}, "http://arxiv.org/abs/2302.02457": {"title": "Scalable inference in functional linear regression with streaming data", "link": "http://arxiv.org/abs/2302.02457", "description": "Traditional static functional data analysis is facing new challenges due to\nstreaming data, where data constantly flow in. A major challenge is that\nstoring such an ever-increasing amount of data in memory is nearly impossible.\nIn addition, existing inferential tools in online learning are mainly developed\nfor finite-dimensional problems, while inference methods for functional data\nare focused on the batch learning setting. In this paper, we tackle these\nissues by developing functional stochastic gradient descent algorithms and\nproposing an online bootstrap resampling procedure to systematically study the\ninference problem for functional linear regression. In particular, the proposed\nestimation and inference procedures use only one pass over the data; thus they\nare easy to implement and suitable to the situation where data arrive in a\nstreaming manner. Furthermore, we establish the convergence rate as well as the\nasymptotic distribution of the proposed estimator. Meanwhile, the proposed\nperturbed estimator from the bootstrap procedure is shown to enjoy the same\ntheoretical properties, which provide the theoretical justification for our\nonline inference tool. As far as we know, this is the first inference result on\nthe functional linear regression model with streaming data. Simulation studies\nare conducted to investigate the finite-sample performance of the proposed\nprocedure. An application is illustrated with the Beijing multi-site\nair-quality data."}, "http://arxiv.org/abs/2303.09598": {"title": "Variational Bayesian analysis of survival data using a log-logistic accelerated failure time model", "link": "http://arxiv.org/abs/2303.09598", "description": "The log-logistic regression model is one of the most commonly used\naccelerated failure time (AFT) models in survival analysis, for which\nstatistical inference methods are mainly established under the frequentist\nframework. Recently, Bayesian inference for log-logistic AFT models using\nMarkov chain Monte Carlo (MCMC) techniques has also been widely developed. In\nthis work, we develop an alternative approach to MCMC methods and infer the\nparameters of the log-logistic AFT model via a mean-field variational Bayes\n(VB) algorithm. A piecewise approximation technique is embedded in deriving the\nVB algorithm to achieve conjugacy. The proposed VB algorithm is evaluated and\ncompared with typical frequentist inferences and MCMC inference using simulated\ndata under various scenarios. A publicly available dataset is employed for\nillustration. We demonstrate that the proposed VB algorithm can achieve good\nestimation accuracy and has a lower computational cost compared with MCMC\nmethods."}, "http://arxiv.org/abs/2304.03853": {"title": "StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables", "link": "http://arxiv.org/abs/2304.03853", "description": "StepMix is an open-source Python package for the pseudo-likelihood estimation\n(one-, two- and three-step approaches) of generalized finite mixture models\n(latent profile and latent class analysis) with external variables (covariates\nand distal outcomes). In many applications in social sciences, the main\nobjective is not only to cluster individuals into latent classes, but also to\nuse these classes to develop more complex statistical models. These models\ngenerally divide into a measurement model that relates the latent classes to\nobserved indicators, and a structural model that relates covariates and outcome\nvariables to the latent classes. The measurement and structural models can be\nestimated jointly using the so-called one-step approach or sequentially using\nstepwise methods, which present significant advantages for practitioners\nregarding the interpretability of the estimated latent classes. In addition to\nthe one-step approach, StepMix implements the most important stepwise\nestimation methods from the literature, including the bias-adjusted three-step\nmethods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the\nmore recent two-step approach. These pseudo-likelihood estimators are presented\nin this paper under a unified framework as specific expectation-maximization\nsubroutines. To facilitate and promote their adoption among the data science\ncommunity, StepMix follows the object-oriented design of the scikit-learn\nlibrary and provides an additional R wrapper."}}
\ No newline at end of file
+{"http://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "http://arxiv.org/abs/2310.03114", "description": "In this article we consider Bayesian parameter inference for a type of\npartially observed stochastic Volterra equation (SVE). SVEs are found in many\nareas such as physics and mathematical finance. In the latter field they can be\nused to represent long memory in unobserved volatility processes. In many cases\nof practical interest, SVEs must be time-discretized and then parameter\ninference is based upon the posterior associated to this time-discretized\nprocess. Based upon recent studies on time-discretization of SVEs (e.g. Richard\net al. 2021), we use Euler-Maruyama methods for the afore-mentioned\ndiscretization. We then show how multilevel Markov chain Monte Carlo (MCMC)\nmethods (Jasra et al. 2018) can be applied in this context. In the examples we\nstudy, we give a proof that shows that the cost to achieve a mean square error\n(MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon&gt;0$, is\n$\\mathcal{O}(\\epsilon^{-20/9})$. If one uses a single level MCMC method then\nthe cost is $\\mathcal{O}(\\epsilon^{-38/9})$ to achieve the same MSE. We\nillustrate these results in the context of state-space and stochastic\nvolatility models, with the latter applied to real data."}, "http://arxiv.org/abs/2310.03164": {"title": "A Hierarchical Random Effects State-space Model for Modeling Brain Activities from Electroencephalogram Data", "link": "http://arxiv.org/abs/2310.03164", "description": "Mental disorders present challenges in diagnosis and treatment due to their\ncomplex and heterogeneous nature. Electroencephalogram (EEG) has shown promise\nas a potential biomarker for these disorders. However, existing methods for\nanalyzing EEG signals have limitations in addressing heterogeneity and\ncapturing complex brain activity patterns between regions. This paper proposes\na novel random effects state-space model (RESSM) for analyzing large-scale\nmulti-channel resting-state EEG signals, accounting for the heterogeneity of\nbrain connectivities between groups and individual subjects. We incorporate\nmulti-level random effects for temporal dynamical and spatial mapping matrices\nand address nonstationarity so that the brain connectivity patterns can vary\nover time. The model is fitted under a Bayesian hierarchical model framework\ncoupled with a Gibbs sampler. Compared to previous mixed-effects state-space\nmodels, we directly model high-dimensional random effects matrices without\nstructural constraints and tackle the challenge of identifiability. Through\nextensive simulation studies, we demonstrate that our approach yields valid\nestimation and inference. We apply RESSM to a multi-site clinical trial of\nMajor Depressive Disorder (MDD). Our analysis uncovers significant differences\nin resting-state brain temporal dynamics among MDD patients compared to healthy\nindividuals. In addition, we show the subject-level EEG features derived from\nRESSM exhibit a superior predictive value for the heterogeneous treatment\neffect compared to the EEG frequency band power, suggesting the potential of\nEEG as a valuable biomarker for MDD."}, "http://arxiv.org/abs/2310.03258": {"title": "Detecting Electricity Service Equity Issues with Transfer Counterfactual Learning on Large-Scale Outage Datasets", "link": "http://arxiv.org/abs/2310.03258", "description": "Energy justice is a growing area of interest in interdisciplinary energy\nresearch. However, identifying systematic biases in the energy sector remains\nchallenging due to confounding variables, intricate heterogeneity in treatment\neffects, and limited data availability. To address these challenges, we\nintroduce a novel approach for counterfactual causal analysis centered on\nenergy justice. We use subgroup analysis to manage diverse factors and leverage\nthe idea of transfer learning to mitigate data scarcity in each subgroup. In\nour numerical analysis, we apply our method to a large-scale customer-level\npower outage data set and investigate the counterfactual effect of demographic\nfactors, such as income and age of the population, on power outage durations.\nOur results indicate that low-income and elderly-populated areas consistently\nexperience longer power outages, regardless of weather conditions. This points\nto existing biases in the power system and highlights the need for focused\nimprovements in areas with economic challenges."}, "http://arxiv.org/abs/2310.03351": {"title": "Efficiently analyzing large patient registries with Bayesian joint models for longitudinal and time-to-event data", "link": "http://arxiv.org/abs/2310.03351", "description": "The joint modeling of longitudinal and time-to-event outcomes has become a\npopular tool in follow-up studies. However, fitting Bayesian joint models to\nlarge datasets, such as patient registries, can require extended computing\ntimes. To speed up sampling, we divided a patient registry dataset into\nsubsamples, analyzed them in parallel, and combined the resulting Markov chain\nMonte Carlo draws into a consensus distribution. We used a simulation study to\ninvestigate how different consensus strategies perform with joint models. In\nparticular, we compared grouping all draws together with using equal- and\nprecision-weighted averages. We considered scenarios reflecting different\nsample sizes, numbers of data splits, and processor characteristics.\nParallelization of the sampling process substantially decreased the time\nrequired to run the model. We found that the weighted-average consensus\ndistributions for large sample sizes were nearly identical to the target\nposterior distribution. The proposed algorithm has been made available in an R\npackage for joint models, JMbayes2. This work was motivated by the clinical\ninterest in investigating the association between ppFEV1, a commonly measured\nmarker of lung function, and the risk of lung transplant or death, using data\nfrom the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals\nwith 372,366 years of cumulative follow-up). Splitting the registry into five\nsubsamples resulted in an 85\\% decrease in computing time, from 9.22 to 1.39\nhours. Splitting the data and finding a consensus distribution by\nprecision-weighted averaging proved to be a computationally efficient and\nrobust approach to handling large datasets under the joint modeling framework."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2310.03630": {"title": "Model-based Clustering for Network Data via a Latent Shrinkage Position Cluster Model", "link": "http://arxiv.org/abs/2310.03630", "description": "Low-dimensional representation and clustering of network data are tasks of\ngreat interest across various fields. Latent position models are routinely used\nfor this purpose by assuming that each node has a location in a low-dimensional\nlatent space, and enabling node clustering. However, these models fall short in\nsimultaneously determining the optimal latent space dimension and the number of\nclusters. Here we introduce the latent shrinkage position cluster model\n(LSPCM), which addresses this limitation. The LSPCM posits a Bayesian\nnonparametric shrinkage prior on the latent positions' variance parameters\nresulting in higher dimensions having increasingly smaller variances, aiding in\nthe identification of dimensions with non-negligible variance. Further, the\nLSPCM assumes the latent positions follow a sparse finite Gaussian mixture\nmodel, allowing for automatic inference on the number of clusters related to\nnon-empty mixture components. As a result, the LSPCM simultaneously infers the\nlatent space dimensionality and the number of clusters, eliminating the need to\nfit and compare multiple models. The performance of the LSPCM is assessed via\nsimulation studies and demonstrated through application to two real Twitter\nnetwork datasets from sporting and political contexts. Open source software is\navailable to promote widespread use of the LSPCM."}, "http://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "http://arxiv.org/abs/2310.03722", "description": "In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$\nof a Gaussian distribution with unknown variance $\\sigma$. Curiously, he\nemployed both an improper (right Haar) mixture over $\\sigma$ and an improper\n(flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his\nconstruction, which use generalized nonintegrable martingales and an extended\nVille's inequality. While this does yield a sequential t-test, it does not\nyield an ``e-process'' (due to the nonintegrability of his martingale). In this\npaper, we develop two new e-processes and confidence sequences for the same\nsetting: one is a test martingale in a reduced filtration, while the other is\nan e-process in the canonical data filtration. These are respectively obtained\nby swapping Lai's flat mixture for a Gaussian mixture, and swapping the right\nHaar mixture over $\\sigma$ with the maximum likelihood estimate under the null,\nas done in universal inference. We also analyze the width of resulting\nconfidence sequences, which have a curious dependence on the error probability\n$\\alpha$. Numerical experiments are provided along the way to compare and\ncontrast the various approaches."}, "http://arxiv.org/abs/2103.10875": {"title": "Scalable Bayesian computation for crossed and nested hierarchical models", "link": "http://arxiv.org/abs/2103.10875", "description": "We develop sampling algorithms to fit Bayesian hierarchical models, the\ncomputational complexity of which scales linearly with the number of\nobservations and the number of parameters in the model. We focus on crossed\nrandom effect and nested multilevel models, which are used ubiquitously in\napplied sciences. The posterior dependence in both classes is sparse: in\ncrossed random effects models it resembles a random graph, whereas in nested\nmultilevel models it is tree-structured. For each class we identify a framework\nfor scalable computation, building on previous work. Methods for crossed models\nare based on extensions of appropriately designed collapsed Gibbs samplers,\nwhere we introduce the idea of local centering; while methods for nested models\nare based on sparse linear algebra and data augmentation. We provide a\ntheoretical analysis of the proposed algorithms in some simplified settings,\nincluding a comparison with previously proposed methodologies and an\naverage-case analysis based on random graph theory. Numerical experiments,\nincluding two challenging real data analyses on predicting electoral results\nand real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo,\ndisplaying drastic improvement in performance."}, "http://arxiv.org/abs/2106.04106": {"title": "A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance", "link": "http://arxiv.org/abs/2106.04106", "description": "Genome-wide association studies (GWAS) have identified thousands of genetic\nvariants associated with complex traits, and some variants are shown to be\nassociated with multiple complex traits. Genetic covariance between two traits\nis defined as the underlying covariance of genetic effects and can be used to\nmeasure the shared genetic architecture. The data used to estimate such a\ngenetic covariance can be from the same group or different groups of\nindividuals, and the traits can be of different types or collected based on\ndifferent study designs. This paper proposes a unified regression-based\napproach to robust estimation and inference for genetic covariance of general\ntraits that may be associated with genetic variants nonlinearly. The asymptotic\nproperties of the proposed estimator are provided and are shown to be robust\nunder certain model mis-specification. Our method under linear working models\nprovides a robust inference for the narrow-sense genetic covariance, even when\nboth linear models are mis-specified. Numerical experiments are performed to\nsupport the theoretical results. Our method is applied to an outbred mice GWAS\ndata set to study the overlapping genetic effects between the behavioral and\nphysiological phenotypes. The real data results reveal interesting genetic\ncovariance among different mice developmental traits."}, "http://arxiv.org/abs/2112.08417": {"title": "Characterization of causal ancestral graphs for time series with latent confounders", "link": "http://arxiv.org/abs/2112.08417", "description": "In this paper, we introduce a novel class of graphical models for\nrepresenting time lag specific causal relationships and independencies of\nmultivariate time series with unobserved confounders. We completely\ncharacterize these graphs and show that they constitute proper subsets of the\ncurrently employed model classes. As we show, from the novel graphs one can\nthus draw stronger causal inferences -- without additional assumptions. We\nfurther introduce a graphical representation of Markov equivalence classes of\nthe novel graphs. This graphical representation contains more causal knowledge\nthan what current state-of-the-art causal discovery algorithms learn."}, "http://arxiv.org/abs/2112.09313": {"title": "Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects", "link": "http://arxiv.org/abs/2112.09313", "description": "Federated learning of causal estimands may greatly improve estimation\nefficiency by leveraging data from multiple study sites, but robustness to\nheterogeneity and model misspecifications is vital for ensuring validity. We\ndevelop a Federated Adaptive Causal Estimation (FACE) framework to incorporate\nheterogeneous data from multiple sites to provide treatment effect estimation\nand inference for a flexibly specified target population of interest. FACE\naccounts for site-level heterogeneity in the distribution of covariates through\ndensity ratio weighting. To safely incorporate source sites and avoid negative\ntransfer, we introduce an adaptive weighting procedure via a penalized\nregression, which achieves both consistency and optimal efficiency. Our\nstrategy is communication-efficient and privacy-preserving, allowing\nparticipating sites to share summary statistics only once with other sites. We\nconduct both theoretical and numerical evaluations of FACE and apply it to\nconduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273\n(Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic\nhealth records from five VA regional sites. We show that compared to\ntraditional methods, FACE meaningfully increases the precision of treatment\neffect estimates, with reductions in standard errors ranging from $26\\%$ to\n$67\\%$."}, "http://arxiv.org/abs/2208.03246": {"title": "Non-Asymptotic Analysis of Ensemble Kalman Updates: Effective Dimension and Localization", "link": "http://arxiv.org/abs/2208.03246", "description": "Many modern algorithms for inverse problems and data assimilation rely on\nensemble Kalman updates to blend prior predictions with observed data. Ensemble\nKalman methods often perform well with a small ensemble size, which is\nessential in applications where generating each particle is costly. This paper\ndevelops a non-asymptotic analysis of ensemble Kalman updates that rigorously\nexplains why a small ensemble size suffices if the prior covariance has\nmoderate effective dimension due to fast spectrum decay or approximate\nsparsity. We present our theory in a unified framework, comparing several\nimplementations of ensemble Kalman updates that use perturbed observations,\nsquare root filtering, and localization. As part of our analysis, we develop\nnew dimension-free covariance estimation bounds for approximately sparse\nmatrices that may be of independent interest."}, "http://arxiv.org/abs/2307.10972": {"title": "Adaptively Weighted Audits of Instant-Runoff Voting Elections: AWAIRE", "link": "http://arxiv.org/abs/2307.10972", "description": "An election audit is risk-limiting if the audit limits (to a pre-specified\nthreshold) the chance that an erroneous electoral outcome will be certified.\nExtant methods for auditing instant-runoff voting (IRV) elections are either\nnot risk-limiting or require cast vote records (CVRs), the voting system's\nelectronic record of the votes on each ballot. CVRs are not always available,\nfor instance, in jurisdictions that tabulate IRV contests manually.\n\nWe develop an RLA method (AWAIRE) that uses adaptively weighted averages of\ntest supermartingales to efficiently audit IRV elections when CVRs are not\navailable. The adaptive weighting 'learns' an efficient set of hypotheses to\ntest to confirm the election outcome. When accurate CVRs are available, AWAIRE\ncan use them to increase the efficiency to match the performance of existing\nmethods that require CVRs.\n\nWe provide an open-source prototype implementation that can handle elections\nwith up to six candidates. Simulations using data from real elections show that\nAWAIRE is likely to be efficient in practice. We discuss how to extend the\ncomputational approach to handle elections with more candidates.\n\nAdaptively weighted averages of test supermartingales are a general tool,\nuseful beyond election audits to test collections of hypotheses sequentially\nwhile rigorously controlling the familywise error rate."}, "http://arxiv.org/abs/2309.10514": {"title": "Partially Specified Causal Simulations", "link": "http://arxiv.org/abs/2309.10514", "description": "Simulation studies play a key role in the validation of causal inference\nmethods. The simulation results are reliable only if the study is designed\naccording to the promised operational conditions of the method-in-test. Still,\nmany causal inference literature tend to design over-restricted or misspecified\nstudies. In this paper, we elaborate on the problem of improper simulation\ndesign for causal methods and compile a list of desiderata for an effective\nsimulation framework. We then introduce partially randomized causal simulation\n(PARCS), a simulation framework that meets those desiderata. PARCS synthesizes\ndata based on graphical causal models and a wide range of adjustable\nparameters. There is a legible mapping from usual causal assumptions to the\nparameters, thus, users can identify and specify the subset of related\nparameters and randomize the remaining ones to generate a range of complying\ndata-generating processes for their causal method. The result is a more\ncomprehensive and inclusive empirical investigation for causal claims. Using\nPARCS, we reproduce and extend the simulation studies of two well-known causal\ndiscovery and missing data analysis papers to emphasize the necessity of a\nproper simulation design. Our results show that those papers would have\nimproved and extended the findings, had they used PARCS for simulation. The\nframework is implemented as a Python package, too. By discussing the\ncomprehensiveness and transparency of PARCS, we encourage causal inference\nresearchers to utilize it as a standard tool for future works."}, "http://arxiv.org/abs/2310.03776": {"title": "Significance of the negative binomial distribution in multiplicity phenomena", "link": "http://arxiv.org/abs/2310.03776", "description": "The negative binomial distribution (NBD) has been theorized to express a\nscale-invariant property of many-body systems and has been consistently shown\nto outperform other statistical models in both describing the multiplicity of\nquantum-scale events in particle collision experiments and predicting the\nprevalence of cosmological observables, such as the number of galaxies in a\nregion of space. Despite its widespread applicability and empirical success in\nthese contexts, a theoretical justification for the NBD from first principles\nhas remained elusive for fifty years. The accuracy of the NBD in modeling\nhadronic, leptonic, and semileptonic processes is suggestive of a highly\ngeneral principle, which is yet to be understood. This study demonstrates that\na statistical event of the NBD can in fact be derived in a general context via\nthe dynamical equations of a canonical ensemble of particles in Minkowski\nspace. These results describe a fundamental feature of many-body systems that\nis consistent with data from the ALICE and ATLAS experiments and provides an\nexplanation for the emergence of the NBD in these multiplicity observations.\nTwo methods are used to derive this correspondence: the Feynman path integral\nand a hypersurface parametrization of a propagating ensemble."}, "http://arxiv.org/abs/2310.04030": {"title": "Robust inference with GhostKnockoffs in genome-wide association studies", "link": "http://arxiv.org/abs/2310.04030", "description": "Genome-wide association studies (GWASs) have been extensively adopted to\ndepict the underlying genetic architecture of complex diseases. Motivated by\nGWASs' limitations in identifying small effect loci to understand complex\ntraits' polygenicity and fine-mapping putative causal variants from proxy ones,\nwe propose a knockoff-based method which only requires summary statistics from\nGWASs and demonstrate its validity in the presence of relatedness. We show that\nGhostKnockoffs inference is robust to its input Z-scores as long as they are\nfrom valid marginal association tests and their correlations are consistent\nwith the correlations among the corresponding genetic variants. The property\ngeneralizes GhostKnockoffs to other GWASs settings, such as the meta-analysis\nof multiple overlapping studies and studies based on association test\nstatistics deviated from score tests. We demonstrate GhostKnockoffs'\nperformance using empirical simulation and a meta-analysis of nine European\nancestral genome-wide association studies and whole exome/genome sequencing\nstudies. Both results demonstrate that GhostKnockoffs identify more putative\ncausal variants with weak genotype-phenotype associations that are missed by\nconventional GWASs."}, "http://arxiv.org/abs/2310.04082": {"title": "An energy-based model approach to rare event probability estimation", "link": "http://arxiv.org/abs/2310.04082", "description": "The estimation of rare event probabilities plays a pivotal role in diverse\nfields. Our aim is to determine the probability of a hazard or system failure\noccurring when a quantity of interest exceeds a critical value. In our\napproach, the distribution of the quantity of interest is represented by an\nenergy density, characterized by a free energy function. To efficiently\nestimate the free energy, a bias potential is introduced. Using concepts from\nenergy-based models (EBM), this bias potential is optimized such that the\ncorresponding probability density function approximates a pre-defined\ndistribution targeting the failure region of interest. Given the optimal bias\npotential, the free energy function and the rare event probability of interest\ncan be determined. The approach is applicable not just in traditional rare\nevent settings where the variable upon which the quantity of interest relies\nhas a known distribution, but also in inversion settings where the variable\nfollows a posterior distribution. By combining the EBM approach with a Stein\ndiscrepancy-based stopping criterion, we aim for a balanced accuracy-efficiency\ntrade-off. Furthermore, we explore both parametric and non-parametric\napproaches for the bias potential, with the latter eliminating the need for\nchoosing a particular parameterization, but depending strongly on the accuracy\nof the kernel density estimate used in the optimization process. Through three\nillustrative test cases encompassing both traditional and inversion settings,\nwe show that the proposed EBM approach, when properly configured, (i) allows\nstable and efficient estimation of rare event probabilities and (ii) compares\nfavorably against subset sampling approaches."}, "http://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "http://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set\nof likelihood components. This approach provides a flexible framework for\ndrawing inference when the likelihood function of a statistical model is\ncomputationally intractable. While composite likelihood has computational\nadvantages, it can still be demanding when dealing with numerous likelihood\ncomponents and a large sample size. This paper tackles this challenge by\nemploying an approximation of the conventional composite likelihood estimator,\nwhich is derived from an optimization procedure relying on stochastic\ngradients. This novel estimator is shown to be asymptotically normally\ndistributed around the true parameter. In particular, based on the relative\ndivergent rate of the sample size and the number of iterations of the\noptimization, the variance of the limiting distribution is shown to compound\nfor two sources of uncertainty: the sampling variability of the data and the\noptimization noise, with the latter depending on the sampling distribution used\nto construct the stochastic gradients. The advantages of the proposed framework\nare illustrated through simulation studies on two working examples: an Ising\nmodel for binary data and a gamma frailty model for count data. Finally, a\nreal-data application is presented, showing its effectiveness in a large-scale\nmental health survey."}, "http://arxiv.org/abs/1904.06340": {"title": "A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Processes", "link": "http://arxiv.org/abs/1904.06340", "description": "This paper develops a unified and computationally efficient method for\nchange-point estimation along the time dimension in a non-stationary\nspatio-temporal process. By modeling a non-stationary spatio-temporal process\nas a piecewise stationary spatio-temporal process, we consider simultaneous\nestimation of the number and locations of change-points, and model parameters\nin each segment. A composite likelihood-based criterion is developed for\nchange-point and parameters estimation. Under the framework of increasing\ndomain asymptotics, theoretical results including consistency and distribution\nof the estimators are derived under mild conditions. In contrast to classical\nresults in fixed dimensional time series that the localization error of\nchange-point estimator is $O_{p}(1)$, exact recovery of true change-points can\nbe achieved in the spatio-temporal setting. More surprisingly, the consistency\nof change-point estimation can be achieved without any penalty term in the\ncriterion function. In addition, we further establish consistency of the number\nand locations of the change-point estimator under the infill asymptotics\nframework where the time domain is increasing while the spatial sampling domain\nis fixed. A computationally efficient pruned dynamic programming algorithm is\ndeveloped for the challenging criterion optimization problem. Extensive\nsimulation studies and an application to U.S. precipitation data are provided\nto demonstrate the effectiveness and practicality of the proposed method."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2208.00137": {"title": "Efficient estimation and inference for the signed $\\beta$-model in directed signed networks", "link": "http://arxiv.org/abs/2208.00137", "description": "This paper proposes a novel signed $\\beta$-model for directed signed network,\nwhich is frequently encountered in application domains but largely neglected in\nliterature. The proposed signed $\\beta$-model decomposes a directed signed\nnetwork as the difference of two unsigned networks and embeds each node with\ntwo latent factors for in-status and out-status. The presence of negative edges\nleads to a non-concave log-likelihood, and a one-step estimation algorithm is\ndeveloped to facilitate parameter estimation, which is efficient both\ntheoretically and computationally. We also develop an inferential procedure for\npairwise and multiple node comparisons under the signed $\\beta$-model, which\nfills the void of lacking uncertainty quantification for node ranking.\nTheoretical results are established for the coverage probability of confidence\ninterval, as well as the false discovery rate (FDR) control for multiple node\ncomparison. The finite sample performance of the signed $\\beta$-model is also\nexamined through extensive numerical experiments on both synthetic and\nreal-life networks."}, "http://arxiv.org/abs/2208.08401": {"title": "Conformal Inference for Online Prediction with Arbitrary Distribution Shifts", "link": "http://arxiv.org/abs/2208.08401", "description": "We consider the problem of forming prediction sets in an online setting where\nthe distribution generating the data is allowed to vary over time. Previous\napproaches to this problem suffer from over-weighting historical data and thus\nmay fail to quickly react to the underlying dynamics. Here we correct this\nissue and develop a novel procedure with provably small regret over all local\ntime intervals of a given width. We achieve this by modifying the adaptive\nconformal inference (ACI) algorithm of Gibbs and Cand\\`{e}s (2021) to contain\nan additional step in which the step-size parameter of ACI's gradient descent\nupdate is tuned over time. Crucially, this means that unlike ACI, which\nrequires knowledge of the rate of change of the data-generating mechanism, our\nnew procedure is adaptive to both the size and type of the distribution shift.\nOur methods are highly flexible and can be used in combination with any\nbaseline predictive algorithm that produces point estimates or estimated\nquantiles of the target without the need for distributional assumptions. We\ntest our techniques on two real-world datasets aimed at predicting stock market\nvolatility and COVID-19 case counts and find that they are robust and adaptive\nto real-world distribution shifts."}, "http://arxiv.org/abs/2303.01031": {"title": "Identifiability and Consistent Estimation of the Gaussian Chain Graph Model", "link": "http://arxiv.org/abs/2303.01031", "description": "The chain graph model admits both undirected and directed edges in one graph,\nwhere symmetric conditional dependencies are encoded via undirected edges and\nasymmetric causal relations are encoded via directed edges. Though frequently\nencountered in practice, the chain graph model has been largely under\ninvestigated in literature, possibly due to the lack of identifiability\nconditions between undirected and directed edges. In this paper, we first\nestablish a set of novel identifiability conditions for the Gaussian chain\ngraph model, exploiting a low rank plus sparse decomposition of the precision\nmatrix. Further, an efficient learning algorithm is built upon the\nidentifiability conditions to fully recover the chain graph structure.\nTheoretical analysis on the proposed method is conducted, assuring its\nasymptotic consistency in recovering the exact chain graph structure. The\nadvantage of the proposed method is also supported by numerical experiments on\nboth simulated examples and a real application on the Standard &amp; Poor 500 index\ndata."}, "http://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "http://arxiv.org/abs/2305.10817", "description": "We introduce an approach which allows detecting causal relationships between\nvariables for which the time evolution is available. Causality is assessed by a\nvariational scheme based on the Information Imbalance of distance ranks, a\nstatistical test capable of inferring the relative information content of\ndifferent distance measures. We test whether the predictability of a putative\ndriven system Y can be improved by incorporating information from a potential\ndriver system X, without making assumptions on the underlying dynamics and\nwithout the need to compute probability densities of the dynamic variables.\nThis framework makes causality detection possible even for high-dimensional\nsystems where only few of the variables are known or measured. Benchmark tests\non coupled chaotic dynamical systems demonstrate that our approach outperforms\nother model-free causality detection methods, successfully handling both\nunidirectional and bidirectional couplings. We also show that the method can be\nused to robustly detect causality in human electroencephalography data."}, "http://arxiv.org/abs/2309.06264": {"title": "Spectral clustering algorithm for the allometric extension model", "link": "http://arxiv.org/abs/2309.06264", "description": "The spectral clustering algorithm is often used as a binary clustering method\nfor unclassified data by applying the principal component analysis. To study\ntheoretical properties of the algorithm, the assumption of conditional\nhomoscedasticity is often supposed in existing studies. However, this\nassumption is restrictive and often unrealistic in practice. Therefore, in this\npaper, we consider the allometric extension model, that is, the directions of\nthe first eigenvectors of two covariance matrices and the direction of the\ndifference of two mean vectors coincide, and we provide a non-asymptotic bound\nof the error probability of the spectral clustering algorithm for the\nallometric extension model. As a byproduct of the result, we obtain the\nconsistency of the clustering method in high-dimensional settings."}, "http://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "http://arxiv.org/abs/2309.12833", "description": "Discovering causal relationships from observational data is a fundamental yet\nchallenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a\nmethod for causal feature selection which requires data from heterogeneous\nsettings and exploits that causal models are invariant. ICP has been extended\nto general additive noise models and to nonparametric settings using\nconditional independence tests. However, the latter often suffer from low power\n(or poor type I error control) and additive noise models are not suitable for\napplications in which the response is not measured on a continuous scale, but\nreflects categories or counts. Here, we develop transformation-model (TRAM)\nbased ICP, allowing for continuous, categorical, count-type, and\nuninformatively censored responses (these model classes, generally, do not\nallow for identifiability when there is no exogenous heterogeneity). As an\ninvariance test, we propose TRAM-GCM based on the expected conditional\ncovariance between environments and score residuals with uniform asymptotic\nlevel guarantees. For the special case of linear shift TRAMs, we also consider\nTRAM-Wald, which tests invariance based on the Wald statistic. We provide an\nopen-source R package 'tramicp' and evaluate our approach on simulated data and\nin a case study investigating causal features of survival in critically ill\npatients."}, "http://arxiv.org/abs/2310.04452": {"title": "Short text classification with machine learning in the social sciences: The case of climate change on Twitter", "link": "http://arxiv.org/abs/2310.04452", "description": "To analyse large numbers of texts, social science researchers are\nincreasingly confronting the challenge of text classification. When manual\nlabeling is not possible and researchers have to find automatized ways to\nclassify texts, computer science provides a useful toolbox of machine-learning\nmethods whose performance remains understudied in the social sciences. In this\narticle, we compare the performance of the most widely used text classifiers by\napplying them to a typical research scenario in social science research: a\nrelatively small labeled dataset with infrequent occurrence of categories of\ninterest, which is a part of a large unlabeled dataset. As an example case, we\nlook at Twitter communication regarding climate change, a topic of increasing\nscholarly interest in interdisciplinary social science research. Using a novel\ndataset including 5,750 tweets from various international organizations\nregarding the highly ambiguous concept of climate change, we evaluate the\nperformance of methods in automatically classifying tweets based on whether\nthey are about climate change or not. In this context, we highlight two main\nfindings. First, supervised machine-learning methods perform better than\nstate-of-the-art lexicons, in particular as class balance increases. Second,\ntraditional machine-learning methods, such as logistic regression and random\nforest, perform similarly to sophisticated deep-learning methods, whilst\nrequiring much less training time and computational resources. The results have\nimportant implications for the analysis of short texts in social science\nresearch."}, "http://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "http://arxiv.org/abs/2310.04563", "description": "During the COVID-19 pandemic, implementing in-person indoor instruction in a\nsafe manner was a high priority for universities nationwide. To support this\neffort at the University, we developed a mathematical model for estimating the\nrisk of SARS-CoV-2 transmission in university classrooms. This model was used\nto design a safe classroom environment at the University during the COVID-19\npandemic that supported the higher occupancy levels needed to match\npre-pandemic numbers of in-person courses, despite a limited number of large\nclassrooms. A retrospective analysis at the end of the semester confirmed the\nmodel's assessment that the proposed classroom configuration would be safe. Our\nframework is generalizable and was also used to support reopening decisions at\nStanford University. In addition, our methods are flexible; our modeling\nframework was repurposed to plan for large university events and gatherings. We\nfound that our approach and methods work in a wide range of indoor settings and\ncould be used to support reopening planning across various industries, from\nsecondary schools to movie theaters and restaurants."}, "http://arxiv.org/abs/2310.04578": {"title": "TNDDR: Efficient and doubly robust estimation of COVID-19 vaccine effectiveness under the test-negative design", "link": "http://arxiv.org/abs/2310.04578", "description": "While the test-negative design (TND), which is routinely used for monitoring\nseasonal flu vaccine effectiveness (VE), has recently become integral to\nCOVID-19 vaccine surveillance, it is susceptible to selection bias due to\noutcome-dependent sampling. Some studies have addressed the identifiability and\nestimation of causal parameters under the TND, but efficiency bounds for\nnonparametric estimators of the target parameter under the unconfoundedness\nassumption have not yet been investigated. We propose a one-step doubly robust\nand locally efficient estimator called TNDDR (TND doubly robust), which\nutilizes sample splitting and can incorporate machine learning techniques to\nestimate the nuisance functions. We derive the efficient influence function\n(EIF) for the marginal expectation of the outcome under a vaccination\nintervention, explore the von Mises expansion, and establish the conditions for\n$\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR.\nThe proposed TNDDR is supported by both theoretical and empirical\njustifications, and we apply it to estimate COVID-19 VE in an administrative\ndataset of community-dwelling older people (aged $\\geq 60$y) in the province of\nQu\\'ebec, Canada."}, "http://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "http://arxiv.org/abs/2310.04660", "description": "Many scientific questions in biomedical, environmental, and psychological\nresearch involve understanding the impact of multiple factors on outcomes.\nWhile randomized factorial experiments are ideal for this purpose,\nrandomization is infeasible in many empirical studies. Therefore, investigators\noften rely on observational data, where drawing reliable causal inferences for\nmultiple factors remains challenging. As the number of treatment combinations\ngrows exponentially with the number of factors, some treatment combinations can\nbe rare or even missing by chance in observed data, further complicating\nfactorial effects estimation. To address these challenges, we propose a novel\nweighting method tailored to observational studies with multiple factors. Our\napproach uses weighted observational data to emulate a randomized factorial\nexperiment, enabling simultaneous estimation of the effects of multiple factors\nand their interactions. Our investigations reveal a crucial nuance: achieving\nbalance among covariates, as in single-factor scenarios, is necessary but\ninsufficient for unbiasedly estimating factorial effects. Our findings suggest\nthat balancing the factors is also essential in multi-factor settings.\nMoreover, we extend our weighting method to handle missing treatment\ncombinations in observed data. Finally, we study the asymptotic behavior of the\nnew weighting estimators and propose a consistent variance estimator, providing\nreliable inferences on factorial effects in observational studies."}, "http://arxiv.org/abs/2310.04709": {"title": "Time-dependent mediators in survival analysis: Graphical representation of causal assumptions", "link": "http://arxiv.org/abs/2310.04709", "description": "We study time-dependent mediators in survival analysis using a treatment\nseparation approach due to Didelez [2019] and based on earlier work by Robins\nand Richardson [2011]. This approach avoids nested counterfactuals and\ncrossworld assumptions which are otherwise common in mediation analysis. The\ncausal model of treatment, mediators, covariates, confounders and outcome is\nrepresented by causal directed acyclic graphs (DAGs). However, the DAGs tend to\nbe very complex when we have measurements at a large number of time points. We\ntherefore suggest using so-called rolled graphs in which a node represents an\nentire coordinate process instead of a single random variable, leading us to\nfar simpler graphical representations. The rolled graphs are not necessarily\nacyclic; they can be analyzed by $\\delta$-separation which is the appropriate\ngraphical separation criterion in this class of graphs and analogous to\n$d$-separation. In particular, $\\delta$-separation is a graphical tool for\nevaluating if the conditions of the mediation analysis are met or if unmeasured\nconfounders influence the estimated effects. We also state a mediational\ng-formula. This is similar to the approach in Vansteelandt et al. [2019]\nalthough that paper has a different conceptual basis. Finally, we apply this\nframework to a statistical model based on a Cox model with an added treatment\neffect.survival analysis; mediation; causal inference; graphical models; local\nindependence graphs"}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.04919": {"title": "The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models", "link": "http://arxiv.org/abs/2310.04919", "description": "In modern scientific research, the objective is often to identify which\nvariables are associated with an outcome among a large class of potential\npredictors. This goal can be achieved by selecting variables in a manner that\ncontrols the the false discovery rate (FDR), the proportion of irrelevant\npredictors among the selections. Knockoff filtering is a cutting-edge approach\nto variable selection that provides FDR control. Existing knockoff statistics\nfrequently employ linear models to assess relationships between features and\nthe response, but the linearity assumption is often violated in real world\napplications. This may result in poor power to detect truly prognostic\nvariables. We introduce a knockoff statistic based on the conditional\nprediction function (CPF), which can pair with state-of-art machine learning\npredictive models, such as deep neural networks. The CPF statistics can capture\nthe nonlinear relationships between predictors and outcomes while also\naccounting for correlation between features. We illustrate the capability of\nthe CPF statistics to provide superior power over common knockoff statistics\nwith continuous, categorical, and survival outcomes using repeated simulations.\nKnockoff filtering with the CPF statistics is demonstrated using (1) a\nresidential building dataset to select predictors for the actual sales prices\nand (2) the TCGA dataset to select genes that are correlated with disease\nstaging in lung cancer patients."}, "http://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "http://arxiv.org/abs/2310.04924", "description": "Markov chain Monte Carlo significance tests were first introduced by Besag\nand Clifford in [4]. These methods produce statistical valid p-values in\nproblems where sampling from the null hypotheses is intractable. We give an\noverview of the methods of Besag and Clifford and some recent developments. A\nrange of examples and applications are discussed."}, "http://arxiv.org/abs/2310.04934": {"title": "UBSea: A Unified Community Detection Framework", "link": "http://arxiv.org/abs/2310.04934", "description": "Detecting communities in networks and graphs is an important task across many\ndisciplines such as statistics, social science and engineering. There are\ngenerally three different kinds of mixing patterns for the case of two\ncommunities: assortative mixing, disassortative mixing and core-periphery\nstructure. Modularity optimization is a classical way for fitting network\nmodels with communities. However, it can only deal with assortative mixing and\ndisassortative mixing when the mixing pattern is known and fails to discover\nthe core-periphery structure. In this paper, we extend modularity in a\nstrategic way and propose a new framework based on Unified Bigroups Standadized\nEdge-count Analysis (UBSea). It can address all the formerly mentioned\ncommunity mixing structures. In addition, this new framework is able to\nautomatically choose the mixing type to fit the networks. Simulation studies\nshow that the new framework has superb performance in a wide range of settings\nunder the stochastic block model and the degree-corrected stochastic block\nmodel. We show that the new approach produces consistent estimate of the\ncommunities under a suitable signal-to-noise-ratio condition, for the case of a\nblock model with two communities, for both undirected and directed networks.\nThe new method is illustrated through applications to several real-world\ndatasets."}, "http://arxiv.org/abs/2310.05049": {"title": "On Estimation of Optimal Dynamic Treatment Regimes with Multiple Treatments for Survival Data-With Application to Colorectal Cancer Study", "link": "http://arxiv.org/abs/2310.05049", "description": "Dynamic treatment regimes (DTR) are sequential decision rules corresponding\nto several stages of intervention. Each rule maps patients' covariates to\noptional treatments. The optimal dynamic treatment regime is the one that\nmaximizes the mean outcome of interest if followed by the overall population.\nMotivated by a clinical study on advanced colorectal cancer with traditional\nChinese medicine, we propose a censored C-learning (CC-learning) method to\nestimate the dynamic treatment regime with multiple treatments using survival\ndata. To address the challenges of multiple stages with right censoring, we\nmodify the backward recursion algorithm in order to adapt to the flexible\nnumber and timing of treatments. For handling the problem of multiple\ntreatments, we propose a framework from the classification perspective by\ntransferring the problem of optimization with multiple treatment comparisons\ninto an example-dependent cost-sensitive classification problem. With\nclassification and regression tree (CART) as the classifier, the CC-learning\nmethod can produce an estimated optimal DTR with good interpretability. We\ntheoretically prove the optimality of our method and numerically evaluate its\nfinite sample performances through simulation. With the proposed method, we\nidentify the interpretable tree treatment regimes at each stage for the\nadvanced colorectal cancer treatment data from Xiyuan Hospital."}, "http://arxiv.org/abs/2310.05151": {"title": "Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions", "link": "http://arxiv.org/abs/2310.05151", "description": "In clinical trials of longitudinal continuous outcomes, reference based\nimputation (RBI) has commonly been applied to handle missing outcome data in\nsettings where the estimand incorporates the effects of intercurrent events,\ne.g. treatment discontinuation. RBI was originally developed in the multiple\nimputation framework, however recently conditional mean imputation (CMI)\ncombined with the jackknife estimator of the standard error was proposed as a\nway to obtain deterministic treatment effect estimates and correct frequentist\ninference. For both multiple and CMI, a mixed model for repeated measures\n(MMRM) is often used for the imputation model, but this can be computationally\nintensive to fit to multiple data sets (e.g. the jackknife samples) and lead to\nconvergence issues with complex MMRM models with many parameters. Therefore, a\nstep-wise approach based on sequential linear regression (SLR) of the outcomes\nat each visit was developed for the imputation model in the multiple imputation\nframework, but similar developments in the CMI framework are lacking. In this\narticle, we fill this gap in the literature by proposing a SLR approach to\nimplement RBI in the CMI framework, and justify its validity using theoretical\nresults and simulations. We also illustrate our proposal on a real data\napplication."}, "http://arxiv.org/abs/2310.05398": {"title": "Statistical Inference for Modulation Index in Phase-Amplitude Coupling", "link": "http://arxiv.org/abs/2310.05398", "description": "Phase-amplitude coupling is a phenomenon observed in several neurological\nprocesses, where the phase of one signal modulates the amplitude of another\nsignal with a distinct frequency. The modulation index (MI) is a common\ntechnique used to quantify this interaction by assessing the Kullback-Leibler\ndivergence between a uniform distribution and the empirical conditional\ndistribution of amplitudes with respect to the phases of the observed signals.\nThe uniform distribution is an ideal representation that is expected to appear\nunder the absence of coupling. However, it does not reflect the statistical\nproperties of coupling values caused by random chance. In this paper, we\npropose a statistical framework for evaluating the significance of an observed\nMI value based on a null hypothesis that a MI value can be entirely explained\nby chance. Significance is obtained by comparing the value with a reference\ndistribution derived under the null hypothesis of independence (i.e., no\ncoupling) between signals. We derived a closed-form distribution of this null\nmodel, resulting in a scaled beta distribution. To validate the efficacy of our\nproposed framework, we conducted comprehensive Monte Carlo simulations,\nassessing the significance of MI values under various experimental scenarios,\nincluding amplitude modulation, trains of spikes, and sequences of\nhigh-frequency oscillations. Furthermore, we corroborated the reliability of\nour model by comparing its statistical significance thresholds with reported\nvalues from other research studies conducted under different experimental\nsettings. Our method offers several advantages such as meta-analysis\nreliability, simplicity and computational efficiency, as it provides p-values\nand significance levels without resorting to generating surrogate data through\nsampling procedures."}, "http://arxiv.org/abs/2310.05526": {"title": "Projecting infinite time series graphs to finite marginal graphs using number theory", "link": "http://arxiv.org/abs/2310.05526", "description": "In recent years, a growing number of method and application works have\nadapted and applied the causal-graphical-model framework to time series data.\nMany of these works employ time-resolved causal graphs that extend infinitely\ninto the past and future and whose edges are repetitive in time, thereby\nreflecting the assumption of stationary causal relationships. However, most\nresults and algorithms from the causal-graphical-model framework are not\ndesigned for infinite graphs. In this work, we develop a method for projecting\ninfinite time series graphs with repetitive edges to marginal graphical models\non a finite time window. These finite marginal graphs provide the answers to\n$m$-separation queries with respect to the infinite graph, a task that was\npreviously unresolved. Moreover, we argue that these marginal graphs are useful\nfor causal discovery and causal effect estimation in time series, effectively\nenabling to apply results developed for finite graphs to the infinite graphs.\nThe projection procedure relies on finding common ancestors in the\nto-be-projected graph and is, by itself, not new. However, the projection\nprocedure has not yet been algorithmically implemented for time series graphs\nsince in these infinite graphs there can be infinite sets of paths that might\ngive rise to common ancestors. We solve the search over these possibly infinite\nsets of paths by an intriguing combination of path-finding techniques for\nfinite directed graphs and solution theory for linear Diophantine equations. By\nproviding an algorithm that carries out the projection, our paper makes an\nimportant step towards a theoretically-grounded and method-agnostic\ngeneralization of a range of causal inference methods and results to time\nseries."}, "http://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "http://arxiv.org/abs/2310.05539", "description": "In response to the unique challenge created by high-dimensional mediators in\nmediation analysis, this paper presents a novel procedure for testing the\nnullity of the mediation effect in the presence of high-dimensional mediators.\nThe procedure incorporates two distinct features. Firstly, the test remains\nvalid under all cases of the composite null hypothesis, including the\nchallenging scenario where both exposure-mediator and mediator-outcome\ncoefficients are zero. Secondly, it does not impose structural assumptions on\nthe exposure-mediator coefficients, thereby allowing for an arbitrarily strong\nexposure-mediator relationship. To the best of our knowledge, the proposed test\nis the first of its kind to provably possess these two features in\nhigh-dimensional mediation analysis. The validity and consistency of the\nproposed test are established, and its numerical performance is showcased\nthrough simulation studies. The application of the proposed test is\ndemonstrated by examining the mediation effect of DNA methylation between\nsmoking status and lung cancer development."}, "http://arxiv.org/abs/2310.05548": {"title": "Cokrig-and-Regress for Spatially Misaligned Environmental Data", "link": "http://arxiv.org/abs/2310.05548", "description": "Spatially misaligned data, where the response and covariates are observed at\ndifferent spatial locations, commonly arise in many environmental studies. Much\nof the statistical literature on handling spatially misaligned data has been\ndevoted to the case of a single covariate and a linear relationship between the\nresponse and this covariate. Motivated by spatially misaligned data collected\non air pollution and weather in China, we propose a cokrig-and-regress (CNR)\nmethod to estimate spatial regression models involving multiple covariates and\npotentially non-linear associations. The CNR estimator is constructed by\nreplacing the unobserved covariates (at the response locations) by their\ncokriging predictor derived from the observed but misaligned covariates under a\nmultivariate Gaussian assumption, where a generalized Kronecker product\ncovariance is used to account for spatial correlations within and between\ncovariates. A parametric bootstrap approach is employed to bias-correct the CNR\nestimates of the spatial covariance parameters and for uncertainty\nquantification. Simulation studies demonstrate that CNR outperforms several\nexisting methods for handling spatially misaligned data, such as\nnearest-neighbor interpolation. Applying CNR to the spatially misaligned air\npollution and weather data in China reveals a number of non-linear\nrelationships between PM$_{2.5}$ concentration and several meteorological\ncovariates."}, "http://arxiv.org/abs/2310.05622": {"title": "A neutral comparison of statistical methods for time-to-event analyses under non-proportional hazards", "link": "http://arxiv.org/abs/2310.05622", "description": "While well-established methods for time-to-event data are available when the\nproportional hazards assumption holds, there is no consensus on the best\ninferential approach under non-proportional hazards (NPH). However, a wide\nrange of parametric and non-parametric methods for testing and estimation in\nthis scenario have been proposed. To provide recommendations on the statistical\nanalysis of clinical trials where non proportional hazards are expected, we\nconducted a comprehensive simulation study under different scenarios of\nnon-proportional hazards, including delayed onset of treatment effect, crossing\nhazard curves, subgroups with different treatment effect and changing hazards\nafter disease progression. We assessed type I error rate control, power and\nconfidence interval coverage, where applicable, for a wide range of methods\nincluding weighted log-rank tests, the MaxCombo test, summary measures such as\nthe restricted mean survival time (RMST), average hazard ratios, and milestone\nsurvival probabilities as well as accelerated failure time regression models.\nWe found a trade-off between interpretability and power when choosing an\nanalysis strategy under NPH scenarios. While analysis methods based on weighted\nlogrank tests typically were favorable in terms of power, they do not provide\nan easily interpretable treatment effect estimate. Also, depending on the\nweight function, they test a narrow null hypothesis of equal hazard functions\nand rejection of this null hypothesis may not allow for a direct conclusion of\ntreatment benefit in terms of the survival function. In contrast,\nnon-parametric procedures based on well interpretable measures as the RMST\ndifference had lower power in most scenarios. Model based methods based on\nspecific survival distributions had larger power, however often gave biased\nestimates and lower than nominal confidence interval coverage."}, "http://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "http://arxiv.org/abs/2310.05646", "description": "We study transfer learning in the context of estimating piecewise-constant\nsignals when source data, which may be relevant but disparate, are available in\naddition to the target data. We initially investigate transfer learning\nestimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for\nunisource data scenarios and then generalise these estimators to accommodate\nmultisource data. To further reduce estimation errors, especially in scenarios\nwhere some sources significantly differ from the target, we introduce an\ninformative source selection algorithm. We then examine these estimators with\nmultisource selection and establish their minimax optimality under specific\nregularity conditions. It is worth emphasising that, unlike the prevalent\nnarrative in the transfer learning literature that the performance is enhanced\nthrough large source sample sizes, our approaches leverage higher observation\nfrequencies and accommodate diverse frequencies across multiple sources. Our\ntheoretical findings are empirically validated through extensive numerical\nexperiments, with the code available online, see\nhttps://github.com/chrisfanwang/transferlearning"}, "http://arxiv.org/abs/2310.05685": {"title": "Post-Selection Inference for Sparse Estimation", "link": "http://arxiv.org/abs/2310.05685", "description": "When the model is not known and parameter testing or interval estimation is\nconducted after model selection, it is necessary to consider selective\ninference. This paper discusses this issue in the context of sparse estimation.\nFirstly, we describe selective inference related to Lasso as per \\cite{lee},\nand then present polyhedra and truncated distributions when applying it to\nmethods such as Forward Stepwise and LARS. Lastly, we discuss the Significance\nTest for Lasso by \\cite{significant} and the Spacing Test for LARS by\n\\cite{ryan_exact}. This paper serves as a review article.\n\nKeywords: post-selective inference, polyhedron, LARS, lasso, forward\nstepwise, significance test, spacing test."}, "http://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "http://arxiv.org/abs/2310.05921", "description": "We introduce Conformal Decision Theory, a framework for producing safe\nautonomous decisions despite imperfect machine learning predictions. Examples\nof such decisions are ubiquitous, from robot planning algorithms that rely on\npedestrian predictions, to calibrating autonomous manufacturing to exhibit high\nthroughput and low error, to the choice of trusting a nominal policy versus\nswitching to a safe backup policy at run-time. The decisions produced by our\nalgorithms are safe in the sense that they come with provable statistical\nguarantees of having low risk without any assumptions on the world model\nwhatsoever; the observations need not be I.I.D. and can even be adversarial.\nThe theory extends results from conformal prediction to calibrate decisions\ndirectly, without requiring the construction of prediction sets. Experiments\ndemonstrate the utility of our approach in robot motion planning around humans,\nautomated stock trading, and robot manufacturin"}, "http://arxiv.org/abs/2101.06950": {"title": "Learning and scoring Gaussian latent variable causal models with unknown additive interventions", "link": "http://arxiv.org/abs/2101.06950", "description": "With observational data alone, causal structure learning is a challenging\nproblem. The task becomes easier when having access to data collected from\nperturbations of the underlying system, even when the nature of these is\nunknown. Existing methods either do not allow for the presence of latent\nvariables or assume that these remain unperturbed. However, these assumptions\nare hard to justify if the nature of the perturbations is unknown. We provide\nresults that enable scoring causal structures in the setting with additive, but\nunknown interventions. Specifically, we propose a maximum-likelihood estimator\nin a structural equation model that exploits system-wide invariances to output\nan equivalence class of causal structures from perturbation data. Furthermore,\nunder certain structural assumptions on the population model, we provide a\nsimple graphical characterization of all the DAGs in the interventional\nequivalence class. We illustrate the utility of our framework on synthetic data\nas well as real data involving California reservoirs and protein expressions.\nThe software implementation is available as the Python package \\emph{utlvce}."}, "http://arxiv.org/abs/2107.14151": {"title": "Modern Non-Linear Function-on-Function Regression", "link": "http://arxiv.org/abs/2107.14151", "description": "We introduce a new class of non-linear function-on-function regression models\nfor functional data using neural networks. We propose a framework using a\nhidden layer consisting of continuous neurons, called a continuous hidden\nlayer, for functional response modeling and give two model fitting strategies,\nFunctional Direct Neural Network (FDNN) and Functional Basis Neural Network\n(FBNN). Both are designed explicitly to exploit the structure inherent in\nfunctional data and capture the complex relations existing between the\nfunctional predictors and the functional response. We fit these models by\nderiving functional gradients and implement regularization techniques for more\nparsimonious results. We demonstrate the power and flexibility of our proposed\nmethod in handling complex functional models through extensive simulation\nstudies as well as real data examples."}, "http://arxiv.org/abs/2112.00832": {"title": "On the mixed-model analysis of covariance in cluster-randomized trials", "link": "http://arxiv.org/abs/2112.00832", "description": "In the analyses of cluster-randomized trials, mixed-model analysis of\ncovariance (ANCOVA) is a standard approach for covariate adjustment and\nhandling within-cluster correlations. However, when the normality, linearity,\nor the random-intercept assumption is violated, the validity and efficiency of\nthe mixed-model ANCOVA estimators for estimating the average treatment effect\nremain unclear. Under the potential outcomes framework, we prove that the\nmixed-model ANCOVA estimators for the average treatment effect are consistent\nand asymptotically normal under arbitrary misspecification of its working\nmodel. If the probability of receiving treatment is 0.5 for each cluster, we\nfurther show that the model-based variance estimator under mixed-model ANCOVA1\n(ANCOVA without treatment-covariate interactions) remains consistent,\nclarifying that the confidence interval given by standard software is\nasymptotically valid even under model misspecification. Beyond robustness, we\ndiscuss several insights on precision among classical methods for analyzing\ncluster-randomized trials, including the mixed-model ANCOVA, individual-level\nANCOVA, and cluster-level ANCOVA estimators. These insights may inform the\nchoice of methods in practice. Our analytical results and insights are\nillustrated via simulation studies and analyses of three cluster-randomized\ntrials."}, "http://arxiv.org/abs/2201.10770": {"title": "Confidence intervals for the Cox model test error from cross-validation", "link": "http://arxiv.org/abs/2201.10770", "description": "Cross-validation (CV) is one of the most widely used techniques in\nstatistical learning for estimating the test error of a model, but its behavior\nis not yet fully understood. It has been shown that standard confidence\nintervals for test error using estimates from CV may have coverage below\nnominal levels. This phenomenon occurs because each sample is used in both the\ntraining and testing procedures during CV and as a result, the CV estimates of\nthe errors become correlated. Without accounting for this correlation, the\nestimate of the variance is smaller than it should be. One way to mitigate this\nissue is by estimating the mean squared error of the prediction error instead\nusing nested CV. This approach has been shown to achieve superior coverage\ncompared to intervals derived from standard CV. In this work, we generalize the\nnested CV idea to the Cox proportional hazards model and explore various\nchoices of test error for this setting."}, "http://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2202.08419", "description": "In this paper, we develop a novel high-dimensional time-varying coefficient\nestimation method, based on high-dimensional Ito diffusion processes. To\naccount for high-dimensional time-varying coefficients, we first estimate local\n(or instantaneous) coefficients using a time-localized Dantzig selection scheme\nunder a sparsity condition, which results in biased local coefficient\nestimators due to the regularization. To handle the bias, we propose a\ndebiasing scheme, which provides well-performing unbiased local coefficient\nestimators. With the unbiased local coefficient estimators, we estimate the\nintegrated coefficient, and to further account for the sparsity of the\ncoefficient process, we apply thresholding schemes. We call this Thresholding\ndEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED\nestimator. In the empirical analysis, we apply the TED procedure to analyzing\nhigh-dimensional factor models using high-frequency data."}, "http://arxiv.org/abs/2206.12525": {"title": "Causality of Functional Longitudinal Data", "link": "http://arxiv.org/abs/2206.12525", "description": "\"Treatment-confounder feedback\" is the central complication to resolve in\nlongitudinal studies, to infer causality. The existing frameworks for\nidentifying causal effects for longitudinal studies with discrete repeated\nmeasures hinge heavily on assuming that time advances in discrete time steps or\ntreatment changes as a jumping process, rendering the number of \"feedbacks\"\nfinite. However, medical studies nowadays with real-time monitoring involve\nfunctional time-varying outcomes, treatment, and confounders, which leads to an\nuncountably infinite number of feedbacks between treatment and confounders.\nTherefore more general and advanced theory is needed. We generalize the\ndefinition of causal effects under user-specified stochastic treatment regimes\nto longitudinal studies with continuous monitoring and develop an\nidentification framework, allowing right censoring and truncation by death. We\nprovide sufficient identification assumptions including a generalized\nconsistency assumption, a sequential randomization assumption, a positivity\nassumption, and a novel \"intervenable\" assumption designed for the\ncontinuous-time case. Under these assumptions, we propose a g-computation\nprocess and an inverse probability weighting process, which suggest a\ng-computation formula and an inverse probability weighting formula for\nidentification. For practical purposes, we also construct two classes of\npopulation estimating equations to identify these two processes, respectively,\nwhich further suggest a doubly robust identification formula with extra\nrobustness against process misspecification. We prove that our framework fully\ngeneralize the existing frameworks and is nonparametric."}, "http://arxiv.org/abs/2209.08139": {"title": "Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2209.08139", "description": "Bayesian variable selection methods are powerful techniques for fitting and\ninferring on sparse high-dimensional linear regression models. However, many\nare computationally intensive or require restrictive prior distributions on\nmodel parameters. In this paper, we proposed a computationally efficient and\npowerful Bayesian approach for sparse high-dimensional linear regression.\nMinimal prior assumptions on the parameters are required through the use of\nplug-in empirical Bayes estimates of hyperparameters. Efficient maximum a\nposteriori (MAP) estimation is completed through a Parameter-Expanded\nExpectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in\na robust computationally efficient coordinate-wise optimization which -- when\nupdating the coefficient for a particular predictor -- adjusts for the impact\nof other predictor variables. The completion of the E-step uses an approach\nmotivated by the popular two-group approach to multiple testing. The result is\na PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse\nhigh-dimensional linear regression, which can be completed using one-at-a-time\nor all-at-once type optimization. We compare the empirical properties of PROBE\nto comparable approaches with numerous simulation studies and analyses of\ncancer cell drug responses. The proposed approach is implemented in the R\npackage probe."}, "http://arxiv.org/abs/2212.02709": {"title": "SURE-tuned Bridge Regression", "link": "http://arxiv.org/abs/2212.02709", "description": "Consider the {$\\ell_{\\alpha}$} regularized linear regression, also termed\nBridge regression. For $\\alpha\\in (0,1)$, Bridge regression enjoys several\nstatistical properties of interest such as sparsity and near-unbiasedness of\nthe estimates (Fan and Li, 2001). However, the main difficulty lies in the\nnon-convex nature of the penalty for these values of $\\alpha$, which makes an\noptimization procedure challenging and usually it is only possible to find a\nlocal optimum. To address this issue, Polson et al. (2013) took a sampling\nbased fully Bayesian approach to this problem, using the correspondence between\nthe Bridge penalty and a power exponential prior on the regression\ncoefficients. However, their sampling procedure relies on Markov chain Monte\nCarlo (MCMC) techniques, which are inherently sequential and not scalable to\nlarge problem dimensions. Cross validation approaches are similarly\ncomputation-intensive. To this end, our contribution is a novel\n\\emph{non-iterative} method to fit a Bridge regression model. The main\ncontribution lies in an explicit formula for Stein's unbiased risk estimate for\nthe out of sample prediction risk of Bridge regression, which can then be\noptimized to select the desired tuning parameters, allowing us to completely\nbypass MCMC as well as computation-intensive cross validation approaches. Our\nprocedure yields results in a fraction of computational times compared to\niterative schemes, without any appreciable loss in statistical performance. An\nR implementation is publicly available online at:\nhttps://github.com/loriaJ/Sure-tuned_BridgeRegression ."}, "http://arxiv.org/abs/2212.03122": {"title": "Robust convex biclustering with a tuning-free method", "link": "http://arxiv.org/abs/2212.03122", "description": "Biclustering is widely used in different kinds of fields including gene\ninformation analysis, text mining, and recommendation system by effectively\ndiscovering the local correlation between samples and features. However, many\nbiclustering algorithms will collapse when facing heavy-tailed data. In this\npaper, we propose a robust version of convex biclustering algorithm with Huber\nloss. Yet, the newly introduced robustification parameter brings an extra\nburden to selecting the optimal parameters. Therefore, we propose a tuning-free\nmethod for automatically selecting the optimal robustification parameter with\nhigh efficiency. The simulation study demonstrates the more fabulous\nperformance of our proposed method than traditional biclustering methods when\nencountering heavy-tailed noise. A real-life biomedical application is also\npresented. The R package RcvxBiclustr is available at\nhttps://github.com/YifanChen3/RcvxBiclustr."}, "http://arxiv.org/abs/2301.09661": {"title": "Estimating marginal treatment effects from observational studies and indirect treatment comparisons: When are standardization-based methods preferable to those based on propensity score weighting?", "link": "http://arxiv.org/abs/2301.09661", "description": "In light of newly developed standardization methods, we evaluate, via\nsimulation study, how propensity score weighting and standardization -based\napproaches compare for obtaining estimates of the marginal odds ratio and the\nmarginal hazard ratio. Specifically, we consider how the two approaches compare\nin two different scenarios: (1) in a single observational study, and (2) in an\nanchored indirect treatment comparison (ITC) of randomized controlled trials.\nWe present the material in such a way so that the matching-adjusted indirect\ncomparison (MAIC) and the (novel) simulated treatment comparison (STC) methods\nin the ITC setting may be viewed as analogous to the propensity score weighting\nand standardization methods in the single observational study setting. Our\nresults suggest that current recommendations for conducting ITCs can be\nimproved and underscore the importance of adjusting for purely prognostic\nfactors."}, "http://arxiv.org/abs/2302.11746": {"title": "Logistic Regression and Classification with non-Euclidean Covariates", "link": "http://arxiv.org/abs/2302.11746", "description": "We introduce a logistic regression model for data pairs consisting of a\nbinary response and a covariate residing in a non-Euclidean metric space\nwithout vector structures. Based on the proposed model we also develop a binary\nclassifier for non-Euclidean objects. We propose a maximum likelihood estimator\nfor the non-Euclidean regression coefficient in the model, and provide upper\nbounds on the estimation error under various metric entropy conditions that\nquantify complexity of the underlying metric space. Matching lower bounds are\nderived for the important metric spaces commonly seen in statistics,\nestablishing optimality of the proposed estimator in such spaces. Similarly, an\nupper bound on the excess risk of the developed classifier is provided for\ngeneral metric spaces. A finer upper bound and a matching lower bound, and thus\noptimality of the proposed classifier, are established for Riemannian\nmanifolds. We investigate the numerical performance of the proposed estimator\nand classifier via simulation studies, and illustrate their practical merits\nvia an application to task-related fMRI data."}, "http://arxiv.org/abs/2302.13658": {"title": "Robust High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2302.13658", "description": "In this paper, we develop a novel high-dimensional coefficient estimation\nprocedure based on high-frequency data. Unlike usual high-dimensional\nregression procedure such as LASSO, we additionally handle the heavy-tailedness\nof high-frequency observations as well as time variations of coefficient\nprocesses. Specifically, we employ Huber loss and truncation scheme to handle\nheavy-tailed observations, while $\\ell_{1}$-regularization is adopted to\novercome the curse of dimensionality. To account for the time-varying\ncoefficient, we estimate local coefficients which are biased due to the\n$\\ell_{1}$-regularization. Thus, when estimating integrated coefficients, we\npropose a debiasing scheme to enjoy the law of large number property and employ\na thresholding scheme to further accommodate the sparsity of the coefficients.\nWe call this Robust thrEsholding Debiased LASSO (RED-LASSO) estimator. We show\nthat the RED-LASSO estimator can achieve a near-optimal convergence rate. In\nthe empirical study, we apply the RED-LASSO procedure to the high-dimensional\nintegrated coefficient estimation using high-frequency trading data."}, "http://arxiv.org/abs/2307.04754": {"title": "Action-State Dependent Dynamic Model Selection", "link": "http://arxiv.org/abs/2307.04754", "description": "A model among many may only be best under certain states of the world.\nSwitching from a model to another can also be costly. Finding a procedure to\ndynamically choose a model in these circumstances requires to solve a complex\nestimation procedure and a dynamic programming problem. A Reinforcement\nlearning algorithm is used to approximate and estimate from the data the\noptimal solution to this dynamic programming problem. The algorithm is shown to\nconsistently estimate the optimal policy that may choose different models based\non a set of covariates. A typical example is the one of switching between\ndifferent portfolio models under rebalancing costs, using macroeconomic\ninformation. Using a set of macroeconomic variables and price data, an\nempirical application to the aforementioned portfolio problem shows superior\nperformance to choosing the best portfolio model with hindsight."}, "http://arxiv.org/abs/2307.14828": {"title": "Identifying regime switches through Bayesian wavelet estimation: evidence from flood detection in the Taquari River Valley", "link": "http://arxiv.org/abs/2307.14828", "description": "Two-component mixture models have proved to be a powerful tool for modeling\nheterogeneity in several cluster analysis contexts. However, most methods based\non these models assume a constant behavior for the mixture weights, which can\nbe restrictive and unsuitable for some applications. In this paper, we relax\nthis assumption and allow the mixture weights to vary according to the index\n(e.g., time) to make the model more adaptive to a broader range of data sets.\nWe propose an efficient MCMC algorithm to jointly estimate both component\nparameters and dynamic weights from their posterior samples. We evaluate the\nmethod's performance by running Monte Carlo simulation studies under different\nscenarios for the dynamic weights. In addition, we apply the algorithm to a\ntime series that records the level reached by a river in southern Brazil. The\nTaquari River is a water body whose frequent flood inundations have caused\nvarious damage to riverside communities. Implementing a dynamic mixture model\nallows us to properly describe the flood regimes for the areas most affected by\nthese phenomena."}, "http://arxiv.org/abs/2310.06130": {"title": "Statistical inference for radially-stable generalized Pareto distributions and return level-sets in geometric extremes", "link": "http://arxiv.org/abs/2310.06130", "description": "We obtain a functional analogue of the quantile function for probability\nmeasures admitting a continuous Lebesgue density on $\\mathbb{R}^d$, and use it\nto characterize the class of non-trivial limit distributions of radially\nrecentered and rescaled multivariate exceedances in geometric extremes. A new\nclass of multivariate distributions is identified, termed radially stable\ngeneralized Pareto distributions, and is shown to admit certain stability\nproperties that permit extrapolation to extremal sets along any direction in\n$\\mathbb{R}^d$. Based on the limit Poisson point process likelihood of the\nradially renormalized point process of exceedances, we develop parsimonious\nstatistical models that exploit theoretical links between structural\nstar-bodies and are amenable to Bayesian inference. The star-bodies determine\nthe mean measure of the limit Poisson process through a hierarchical structure.\nOur framework sharpens statistical inference by suitably including additional\ninformation from the angular directions of the geometric exceedances and\nfacilitates efficient computations in dimensions $d=2$ and $d=3$. Additionally,\nit naturally leads to the notion of the return level-set, which is a canonical\nquantile set expressed in terms of its average recurrence interval, and a\ngeometric analogue of the uni-dimensional return level. We illustrate our\nmethods with a simulation study showing superior predictive performance of\nprobabilities of rare events, and with two case studies, one associated with\nriver flow extremes, and the other with oceanographic extremes."}, "http://arxiv.org/abs/2310.06242": {"title": "Treatment Choice, Mean Square Regret and Partial Identification", "link": "http://arxiv.org/abs/2310.06242", "description": "We consider a decision maker who faces a binary treatment choice when their\nwelfare is only partially identified from data. We contribute to the literature\nby anchoring our finite-sample analysis on mean square regret, a decision\ncriterion advocated by Kitagawa, Lee, and Qiu (2022). We find that optimal\nrules are always fractional, irrespective of the width of the identified set\nand precision of its estimate. The optimal treatment fraction is a simple\nlogistic transformation of the commonly used t-statistic multiplied by a factor\ncalculated by a simple constrained optimization. This treatment fraction gets\ncloser to 0.5 as the width of the identified set becomes wider, implying the\ndecision maker becomes more cautious against the adversarial Nature."}, "http://arxiv.org/abs/2310.06252": {"title": "Power and sample size calculation of two-sample projection-based testing for sparsely observed functional data", "link": "http://arxiv.org/abs/2310.06252", "description": "Projection-based testing for mean trajectory differences in two groups of\nirregularly and sparsely observed functional data has garnered significant\nattention in the literature because it accommodates a wide spectrum of group\ndifferences and (non-stationary) covariance structures. This article presents\nthe derivation of the theoretical power function and the introduction of a\ncomprehensive power and sample size (PASS) calculation toolkit tailored to the\nprojection-based testing method developed by Wang (2021). Our approach\naccommodates a wide spectrum of group difference scenarios and a broad class of\ncovariance structures governing the underlying processes. Through extensive\nnumerical simulation, we demonstrate the robustness of this testing method by\nshowcasing that its statistical power remains nearly unaffected even when a\ncertain percentage of observations are missing, rendering it 'missing-immune'.\nFurthermore, we illustrate the practical utility of this test through analysis\nof two randomized controlled trials of Parkinson's disease. To facilitate\nimplementation, we provide a user-friendly R package fPASS, complete with a\ndetailed vignette to guide users through its practical application. We\nanticipate that this article will significantly enhance the usability of this\npotent statistical tool across a range of biostatistical applications, with a\nparticular focus on its relevance in the design of clinical trials."}, "http://arxiv.org/abs/2310.06315": {"title": "Ultra-high dimensional confounder selection algorithms comparison with application to radiomics data", "link": "http://arxiv.org/abs/2310.06315", "description": "Radiomics is an emerging area of medical imaging data analysis particularly\nfor cancer. It involves the conversion of digital medical images into mineable\nultra-high dimensional data. Machine learning algorithms are widely used in\nradiomics data analysis to develop powerful decision support model to improve\nprecision in diagnosis, assessment of prognosis and prediction of therapy\nresponse. However, machine learning algorithms for causal inference have not\nbeen previously employed in radiomics analysis. In this paper, we evaluate the\nvalue of machine learning algorithms for causal inference in radiomics. We\nselect three recent competitive variable selection algorithms for causal\ninference: outcome-adaptive lasso (OAL), generalized outcome-adaptive lasso\n(GOAL) and causal ball screening (CBS). We used a sure independence screening\nprocedure to propose an extension of GOAL and OAL for ultra-high dimensional\ndata, SIS + GOAL and SIS + OAL. We compared SIS + GOAL, SIS + OAL and CBS using\nsimulation study and two radiomics datasets in cancer, osteosarcoma and\ngliosarcoma. The two radiomics studies and the simulation study identified SIS\n+ GOAL as the optimal variable selection algorithm."}, "http://arxiv.org/abs/2310.06330": {"title": "Multivariate moment least-squares estimators for reversible Markov chains", "link": "http://arxiv.org/abs/2310.06330", "description": "Markov chain Monte Carlo (MCMC) is a commonly used method for approximating\nexpectations with respect to probability distributions. Uncertainty assessment\nfor MCMC estimators is essential in practical applications. Moreover, for\nmultivariate functions of a Markov chain, it is important to estimate not only\nthe auto-correlation for each component but also to estimate\ncross-correlations, in order to better assess sample quality, improve estimates\nof effective sample size, and use more effective stopping rules. Berg and Song\n[2022] introduced the moment least squares (momentLS) estimator, a\nshape-constrained estimator for the autocovariance sequence from a reversible\nMarkov chain, for univariate functions of the Markov chain. Based on this\nsequence estimator, they proposed an estimator of the asymptotic variance of\nthe sample mean from MCMC samples. In this study, we propose novel\nautocovariance sequence and asymptotic variance estimators for Markov chain\nfunctions with multiple components, based on the univariate momentLS estimators\nfrom Berg and Song [2022]. We demonstrate strong consistency of the proposed\nauto(cross)-covariance sequence and asymptotic variance matrix estimators. We\nconduct empirical comparisons of our method with other state-of-the-art\napproaches on simulated and real-data examples, using popular samplers\nincluding the random-walk Metropolis sampler and the No-U-Turn sampler from\nSTAN."}, "http://arxiv.org/abs/2310.06357": {"title": "Adaptive Storey's null proportion estimator", "link": "http://arxiv.org/abs/2310.06357", "description": "False discovery rate (FDR) is a commonly used criterion in multiple testing\nand the Benjamini-Hochberg (BH) procedure is arguably the most popular approach\nwith FDR guarantee. To improve power, the adaptive BH procedure has been\nproposed by incorporating various null proportion estimators, among which\nStorey's estimator has gained substantial popularity. The performance of\nStorey's estimator hinges on a critical hyper-parameter, where a pre-fixed\nconfiguration lacks power and existing data-driven hyper-parameters compromise\nthe FDR control. In this work, we propose a novel class of adaptive\nhyper-parameters and establish the FDR control of the associated BH procedure\nusing a martingale argument. Within this class of data-driven hyper-parameters,\nwe present a specific configuration designed to maximize the number of\nrejections and characterize the convergence of this proposal to the optimal\nhyper-parameter under a commonly-used mixture model. We evaluate our adaptive\nStorey's null proportion estimator and the associated BH procedure on extensive\nsimulated data and a motivating protein dataset. Our proposal exhibits\nsignificant power gains when dealing with a considerable proportion of weak\nnon-nulls or a conservative null distribution."}, "http://arxiv.org/abs/2310.06467": {"title": "Advances in Kth nearest-neighbour clutter removal", "link": "http://arxiv.org/abs/2310.06467", "description": "We consider the problem of feature detection in the presence of clutter in\nspatial point processes. Classification methods have been developed in previous\nstudies. Among these, Byers and Raftery (1998) models the observed Kth nearest\nneighbour distances as a mixture distribution and classifies the clutter and\nfeature points consequently. In this paper, we enhance such approach in two\nmanners. First, we propose an automatic procedure for selecting the number of\nnearest neighbours to consider in the classification method by means of\nsegmented regression models. Secondly, with the aim of applying the procedure\nmultiple times to get a ``better\" end result, we propose a stopping criterion\nthat minimizes the overall entropy measure of cluster separation between\nclutter and feature points. The proposed procedures are suitable for a feature\nwith clutter as two superimposed Poisson processes on any space, including\nlinear networks. We present simulations and two case studies of environmental\ndata to illustrate the method."}, "http://arxiv.org/abs/2310.06533": {"title": "Multilevel Monte Carlo for a class of Partially Observed Processes in Neuroscience", "link": "http://arxiv.org/abs/2310.06533", "description": "In this paper we consider Bayesian parameter inference associated to a class\nof partially observed stochastic differential equations (SDE) driven by jump\nprocesses. Such type of models can be routinely found in applications, of which\nwe focus upon the case of neuroscience. The data are assumed to be observed\nregularly in time and driven by the SDE model with unknown parameters. In\npractice the SDE may not have an analytically tractable solution and this leads\nnaturally to a time-discretization. We adapt the multilevel Markov chain Monte\nCarlo method of [11], which works with a hierarchy of time discretizations and\nshow empirically and theoretically that this is preferable to using one single\ntime discretization. The improvement is in terms of the computational cost\nneeded to obtain a pre-specified numerical error. Our approach is illustrated\non models that are found in neuroscience."}, "http://arxiv.org/abs/2310.06653": {"title": "Evaluating causal effects on time-to-event outcomes in an RCT in Oncology with treatment discontinuation due to adverse events", "link": "http://arxiv.org/abs/2310.06653", "description": "In clinical trials, patients sometimes discontinue study treatments\nprematurely due to reasons such as adverse events. Treatment discontinuation\noccurs after the randomisation as an intercurrent event, making causal\ninference more challenging. The Intention-To-Treat (ITT) analysis provides\nvalid causal estimates of the effect of treatment assignment; still, it does\nnot take into account whether or not patients had to discontinue the treatment\nprematurely. We propose to deal with the problem of treatment discontinuation\nusing principal stratification, recognised in the ICH E9(R1) addendum as a\nstrategy for handling intercurrent events. Under this approach, we can\ndecompose the overall ITT effect into principal causal effects for groups of\npatients defined by their potential discontinuation behaviour in continuous\ntime. In this framework, we must consider that discontinuation happening in\ncontinuous time generates an infinite number of principal strata and that\ndiscontinuation time is not defined for patients who would never discontinue.\nAn additional complication is that discontinuation time and time-to-event\noutcomes are subject to administrative censoring. We employ a flexible\nmodel-based Bayesian approach to deal with such complications. We apply the\nBayesian principal stratification framework to analyse synthetic data based on\na recent RCT in Oncology, aiming to assess the causal effects of a new\ninvestigational drug combined with standard of care vs. standard of care alone\non progression-free survival. We simulate data under different assumptions that\nreflect real situations where patients' behaviour depends on critical baseline\ncovariates. Finally, we highlight how such an approach makes it straightforward\nto characterise patients' discontinuation behaviour with respect to the\navailable covariates with the help of a simulation study."}, "http://arxiv.org/abs/2310.06673": {"title": "Assurance Methods for designing a clinical trial with a delayed treatment effect", "link": "http://arxiv.org/abs/2310.06673", "description": "An assurance calculation is a Bayesian alternative to a power calculation.\nOne may be performed to aid the planning of a clinical trial, specifically\nsetting the sample size or to support decisions about whether or not to perform\na study. Immuno-oncology (IO) is a rapidly evolving area in the development of\nanticancer drugs. A common phenomenon that arises from IO trials is one of\ndelayed treatment effects, that is, there is a delay in the separation of the\nsurvival curves. To calculate assurance for a trial in which a delayed\ntreatment effect is likely to be present, uncertainty about key parameters\nneeds to be considered. If uncertainty is not considered, then the number of\npatients recruited may not be enough to ensure we have adequate statistical\npower to detect a clinically relevant treatment effect. We present a new\nelicitation technique for when a delayed treatment effect is likely to be\npresent and show how to compute assurance using these elicited prior\ndistributions. We provide an example to illustrate how this could be used in\npractice. Open-source software is provided for implementing our methods. Our\nmethodology makes the benefits of assurance methods available for the planning\nof IO trials (and others where a delayed treatment expect is likely to occur)."}, "http://arxiv.org/abs/2310.06696": {"title": "Variable selection with FDR control for noisy data -- an application to screening metabolites that are associated with breast and colorectal cancer", "link": "http://arxiv.org/abs/2310.06696", "description": "The rapidly expanding field of metabolomics presents an invaluable resource\nfor understanding the associations between metabolites and various diseases.\nHowever, the high dimensionality, presence of missing values, and measurement\nerrors associated with metabolomics data can present challenges in developing\nreliable and reproducible methodologies for disease association studies.\nTherefore, there is a compelling need to develop robust statistical methods\nthat can navigate these complexities to achieve reliable and reproducible\ndisease association studies. In this paper, we focus on developing such a\nmethodology with an emphasis on controlling the False Discovery Rate during the\nscreening of mutual metabolomic signals for multiple disease outcomes. We\nillustrate the versatility and performance of this procedure in a variety of\nscenarios, dealing with missing data and measurement errors. As a specific\napplication of this novel methodology, we target two of the most prevalent\ncancers among US women: breast cancer and colorectal cancer. By applying our\nmethod to the Wome's Health Initiative data, we successfully identify\nmetabolites that are associated with either or both of these cancers,\ndemonstrating the practical utility and potential of our method in identifying\nconsistent risk factors and understanding shared mechanisms between diseases."}, "http://arxiv.org/abs/2310.06708": {"title": "Adjustment with Three Continuous Variables", "link": "http://arxiv.org/abs/2310.06708", "description": "Spurious association between X and Y may be due to a confounding variable W.\nStatisticians may adjust for W using a variety of techniques. This paper\npresents the results of simulations conducted to assess the performance of\nthose techniques under various, elementary, data-generating processes. The\nresults indicate that no technique is best overall and that specific techniques\nshould be selected based on the particulars of the data-generating process.\nHere we show how causal graphs can guide the selection or design of techniques\nfor statistical adjustment. R programs are provided for researchers interested\nin generalization."}, "http://arxiv.org/abs/2310.06720": {"title": "Asymptotic theory for Bayesian inference and prediction: from the ordinary to a conditional Peaks-Over-Threshold method", "link": "http://arxiv.org/abs/2310.06720", "description": "The Peaks Over Threshold (POT) method is the most popular statistical method\nfor the analysis of univariate extremes. Even though there is a rich applied\nliterature on Bayesian inference for the POT method there is no asymptotic\ntheory for such proposals. Even more importantly, the ambitious and challenging\nproblem of predicting future extreme events according to a proper probabilistic\nforecasting approach has received no attention to date. In this paper we\ndevelop the asymptotic theory (consistency, contraction rates, asymptotic\nnormality and asymptotic coverage of credible intervals) for the Bayesian\ninference based on the POT method. We extend such an asymptotic theory to cover\nthe Bayesian inference on the tail properties of the conditional distribution\nof a response random variable conditionally to a vector of random covariates.\nWith the aim to make accurate predictions of severer extreme events than those\noccurred in the past, we specify the posterior predictive distribution of a\nfuture unobservable excess variable in the unconditional and conditional\napproach and we prove that is Wasserstein consistent and derive its contraction\nrates. Simulations show the good performances of the proposed Bayesian\ninferential methods. The analysis of the change in the frequency of financial\ncrises over time shows the utility of our methodology."}, "http://arxiv.org/abs/2310.06730": {"title": "Sparse topic modeling via spectral decomposition and thresholding", "link": "http://arxiv.org/abs/2310.06730", "description": "The probabilistic Latent Semantic Indexing model assumes that the expectation\nof the corpus matrix is low-rank and can be written as the product of a\ntopic-word matrix and a word-document matrix. In this paper, we study the\nestimation of the topic-word matrix under the additional assumption that the\nordered entries of its columns rapidly decay to zero. This sparsity assumption\nis motivated by the empirical observation that the word frequencies in a text\noften adhere to Zipf's law. We introduce a new spectral procedure for\nestimating the topic-word matrix that thresholds words based on their corpus\nfrequencies, and show that its $\\ell_1$-error rate under our sparsity\nassumption depends on the vocabulary size $p$ only via a logarithmic term. Our\nerror bound is valid for all parameter regimes and in particular for the\nsetting where $p$ is extremely large; this high-dimensional setting is commonly\nencountered but has not been adequately addressed in prior literature.\nFurthermore, our procedure also accommodates datasets that violate the\nseparability assumption, which is necessary for most prior approaches in topic\nmodeling. Experiments with synthetic data confirm that our procedure is\ncomputationally fast and allows for consistent estimation of the topic-word\nmatrix in a wide variety of parameter regimes. Our procedure also performs well\nrelative to well-established methods when applied to a large corpus of research\npaper abstracts, as well as the analysis of single-cell and microbiome data\nwhere the same statistical model is relevant but the parameter regimes are\nvastly different."}, "http://arxiv.org/abs/2310.06746": {"title": "Causal Rule Learning: Enhancing the Understanding of Heterogeneous Treatment Effect via Weighted Causal Rules", "link": "http://arxiv.org/abs/2310.06746", "description": "Interpretability is a key concern in estimating heterogeneous treatment\neffects using machine learning methods, especially for healthcare applications\nwhere high-stake decisions are often made. Inspired by the Predictive,\nDescriptive, Relevant framework of interpretability, we propose causal rule\nlearning which finds a refined set of causal rules characterizing potential\nsubgroups to estimate and enhance our understanding of heterogeneous treatment\neffects. Causal rule learning involves three phases: rule discovery, rule\nselection, and rule analysis. In the rule discovery phase, we utilize a causal\nforest to generate a pool of causal rules with corresponding subgroup average\ntreatment effects. The selection phase then employs a D-learning method to\nselect a subset of these rules to deconstruct individual-level treatment\neffects as a linear combination of the subgroup-level effects. This helps to\nanswer an ignored question by previous literature: what if an individual\nsimultaneously belongs to multiple groups with different average treatment\neffects? The rule analysis phase outlines a detailed procedure to further\nanalyze each rule in the subset from multiple perspectives, revealing the most\npromising rules for further validation. The rules themselves, their\ncorresponding subgroup treatment effects, and their weights in the linear\ncombination give us more insights into heterogeneous treatment effects.\nSimulation and real-world data analysis demonstrate the superior performance of\ncausal rule learning on the interpretable estimation of heterogeneous treatment\neffect when the ground truth is complex and the sample size is sufficient."}, "http://arxiv.org/abs/2310.06808": {"title": "Odds are the sign is right", "link": "http://arxiv.org/abs/2310.06808", "description": "This article introduces a new condition based on odds ratios for sensitivity\nanalysis. The analysis involves the average effect of a treatment or exposure\non a response or outcome with estimates adjusted for and conditional on a\nsingle, unmeasured, dichotomous covariate. Results of statistical simulations\nare displayed to show that the odds ratio condition is as reliable as other\ncommonly used conditions for sensitivity analysis. Other conditions utilize\nquantities reflective of a mediating covariate. The odds ratio condition can be\napplied when the covariate is a confounding variable. As an example application\nwe use the odds ratio condition to analyze and interpret a positive association\nobserved between Zika virus infection and birth defects."}, "http://arxiv.org/abs/2009.07551": {"title": "Manipulation-Robust Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2009.07551", "description": "We present a new identification condition for regression discontinuity\ndesigns. We replace the local randomization of Lee (2008) with two restrictions\non its threat, namely, the manipulation of the running variable. Furthermore,\nwe provide the first auxiliary assumption of McCrary's (2008) diagnostic test\nto detect manipulation. Based on our auxiliary assumption, we derive a novel\nexpression of moments that immediately implies the worst-case bounds of Gerard,\nRokkanen, and Rothe (2020) and an enhanced interpretation of their target\nparameters. We highlight two issues: an overlooked source of identification\nfailure, and a missing auxiliary assumption to detect manipulation. In the case\nstudies, we illustrate our solution to these issues using institutional details\nand economic theories."}, "http://arxiv.org/abs/2204.06030": {"title": "Variable importance measures for heterogeneous causal effects", "link": "http://arxiv.org/abs/2204.06030", "description": "The recognition that personalised treatment decisions lead to better clinical\noutcomes has sparked recent research activity in the following two domains.\nPolicy learning focuses on finding optimal treatment rules (OTRs), which\nexpress whether an individual would be better off with or without treatment,\ngiven their measured characteristics. OTRs optimize a pre-set population\ncriterion, but do not provide insight into the extent to which treatment\nbenefits or harms individual subjects. Estimates of conditional average\ntreatment effects (CATEs) do offer such insights, but valid inference is\ncurrently difficult to obtain when data-adaptive methods are used. Moreover,\nclinicians are (rightly) hesitant to blindly adopt OTR or CATE estimates, not\nleast since both may represent complicated functions of patient characteristics\nthat provide little insight into the key drivers of heterogeneity. To address\nthese limitations, we introduce novel nonparametric treatment effect variable\nimportance measures (TE-VIMs). TE-VIMs extend recent regression-VIMs, viewed as\nnonparametric analogues to ANOVA statistics. By not being tied to a particular\nmodel, they are amenable to data-adaptive (machine learning) estimation of the\nCATE, itself an active area of research. Estimators for the proposed statistics\nare derived from their efficient influence curves and these are illustrated\nthrough a simulation study and an applied example."}, "http://arxiv.org/abs/2204.07907": {"title": "Just Identified Indirect Inference Estimator: Accurate Inference through Bias Correction", "link": "http://arxiv.org/abs/2204.07907", "description": "An important challenge in statistical analysis lies in controlling the\nestimation bias when handling the ever-increasing data size and model\ncomplexity of modern data settings. In this paper, we propose a reliable\nestimation and inference approach for parametric models based on the Just\nIdentified iNdirect Inference estimator (JINI). The key advantage of our\napproach is that it allows to construct a consistent estimator in a simple\nmanner, while providing strong bias correction guarantees that lead to accurate\ninference. Our approach is particularly useful for complex parametric models,\nas it allows to bypass the analytical and computational difficulties (e.g., due\nto intractable estimating equation) typically encountered in standard\nprocedures. The properties of JINI (including consistency, asymptotic\nnormality, and its bias correction property) are also studied when the\nparameter dimension is allowed to diverge, which provide the theoretical\nfoundation to explain the advantageous performance of JINI in increasing\ndimensional covariates settings. Our simulations and an alcohol consumption\ndata analysis highlight the practical usefulness and excellent performance of\nJINI when data present features (e.g., misclassification, rounding) as well as\nin robust estimation."}, "http://arxiv.org/abs/2209.05598": {"title": "Learning domain-specific causal discovery from time series", "link": "http://arxiv.org/abs/2209.05598", "description": "Causal discovery (CD) from time-varying data is important in neuroscience,\nmedicine, and machine learning. Techniques for CD encompass randomized\nexperiments, which are generally unbiased but expensive, and algorithms such as\nGranger causality, conditional-independence-based, structural-equation-based,\nand score-based methods that are only accurate under strong assumptions made by\nhuman designers. However, as demonstrated in other areas of machine learning,\nhuman expertise is often not entirely accurate and tends to be outperformed in\ndomains with abundant data. In this study, we examine whether we can enhance\ndomain-specific causal discovery for time series using a data-driven approach.\nOur findings indicate that this procedure significantly outperforms\nhuman-designed, domain-agnostic causal discovery methods, such as Mutual\nInformation, VAR-LiNGAM, and Granger Causality on the MOS 6502 microprocessor,\nthe NetSim fMRI dataset, and the Dream3 gene dataset. We argue that, when\nfeasible, the causality field should consider a supervised approach in which\ndomain-specific CD procedures are learned from extensive datasets with known\ncausal relationships, rather than being designed by human specialists. Our\nfindings promise a new approach toward improving CD in neural and medical data\nand for the broader machine learning community."}, "http://arxiv.org/abs/2209.05795": {"title": "Joint modelling of the body and tail of bivariate data", "link": "http://arxiv.org/abs/2209.05795", "description": "In situations where both extreme and non-extreme data are of interest,\nmodelling the whole data set accurately is important. In a univariate\nframework, modelling the bulk and tail of a distribution has been extensively\nstudied before. However, when more than one variable is of concern, models that\naim specifically at capturing both regions correctly are scarce in the\nliterature. A dependence model that blends two copulas with different\ncharacteristics over the whole range of the data support is proposed. One\ncopula is tailored to the bulk and the other to the tail, with a dynamic\nweighting function employed to transition smoothly between them. Tail\ndependence properties are investigated numerically and simulation is used to\nconfirm that the blended model is sufficiently flexible to capture a wide\nvariety of structures. The model is applied to study the dependence between\ntemperature and ozone concentration at two sites in the UK and compared with a\nsingle copula fit. The proposed model provides a better, more flexible, fit to\nthe data, and is also capable of capturing complex dependence structures."}, "http://arxiv.org/abs/2212.14650": {"title": "Two-step estimators of high dimensional correlation matrices", "link": "http://arxiv.org/abs/2212.14650", "description": "We investigate block diagonal and hierarchical nested stochastic multivariate\nGaussian models by studying their sample cross-correlation matrix on high\ndimensions. By performing numerical simulations, we compare a filtered sample\ncross-correlation with the population cross-correlation matrices by using\nseveral rotationally invariant estimators (RIE) and hierarchical clustering\nestimators (HCE) under several loss functions. We show that at large but finite\nsample size, sample cross-correlation filtered by RIE estimators are often\noutperformed by HCE estimators for several of the loss functions. We also show\nthat for block models and for hierarchically nested block models the best\ndetermination of the filtered sample cross-correlation is achieved by\nintroducing two-step estimators combining state-of-the-art non-linear shrinkage\nmodels with hierarchical clustering estimators."}, "http://arxiv.org/abs/2302.02457": {"title": "Scalable inference in functional linear regression with streaming data", "link": "http://arxiv.org/abs/2302.02457", "description": "Traditional static functional data analysis is facing new challenges due to\nstreaming data, where data constantly flow in. A major challenge is that\nstoring such an ever-increasing amount of data in memory is nearly impossible.\nIn addition, existing inferential tools in online learning are mainly developed\nfor finite-dimensional problems, while inference methods for functional data\nare focused on the batch learning setting. In this paper, we tackle these\nissues by developing functional stochastic gradient descent algorithms and\nproposing an online bootstrap resampling procedure to systematically study the\ninference problem for functional linear regression. In particular, the proposed\nestimation and inference procedures use only one pass over the data; thus they\nare easy to implement and suitable to the situation where data arrive in a\nstreaming manner. Furthermore, we establish the convergence rate as well as the\nasymptotic distribution of the proposed estimator. Meanwhile, the proposed\nperturbed estimator from the bootstrap procedure is shown to enjoy the same\ntheoretical properties, which provide the theoretical justification for our\nonline inference tool. As far as we know, this is the first inference result on\nthe functional linear regression model with streaming data. Simulation studies\nare conducted to investigate the finite-sample performance of the proposed\nprocedure. An application is illustrated with the Beijing multi-site\nair-quality data."}, "http://arxiv.org/abs/2303.09598": {"title": "Variational Bayesian analysis of survival data using a log-logistic accelerated failure time model", "link": "http://arxiv.org/abs/2303.09598", "description": "The log-logistic regression model is one of the most commonly used\naccelerated failure time (AFT) models in survival analysis, for which\nstatistical inference methods are mainly established under the frequentist\nframework. Recently, Bayesian inference for log-logistic AFT models using\nMarkov chain Monte Carlo (MCMC) techniques has also been widely developed. In\nthis work, we develop an alternative approach to MCMC methods and infer the\nparameters of the log-logistic AFT model via a mean-field variational Bayes\n(VB) algorithm. A piecewise approximation technique is embedded in deriving the\nVB algorithm to achieve conjugacy. The proposed VB algorithm is evaluated and\ncompared with typical frequentist inferences and MCMC inference using simulated\ndata under various scenarios. A publicly available dataset is employed for\nillustration. We demonstrate that the proposed VB algorithm can achieve good\nestimation accuracy and has a lower computational cost compared with MCMC\nmethods."}, "http://arxiv.org/abs/2304.03853": {"title": "StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables", "link": "http://arxiv.org/abs/2304.03853", "description": "StepMix is an open-source Python package for the pseudo-likelihood estimation\n(one-, two- and three-step approaches) of generalized finite mixture models\n(latent profile and latent class analysis) with external variables (covariates\nand distal outcomes). In many applications in social sciences, the main\nobjective is not only to cluster individuals into latent classes, but also to\nuse these classes to develop more complex statistical models. These models\ngenerally divide into a measurement model that relates the latent classes to\nobserved indicators, and a structural model that relates covariates and outcome\nvariables to the latent classes. The measurement and structural models can be\nestimated jointly using the so-called one-step approach or sequentially using\nstepwise methods, which present significant advantages for practitioners\nregarding the interpretability of the estimated latent classes. In addition to\nthe one-step approach, StepMix implements the most important stepwise\nestimation methods from the literature, including the bias-adjusted three-step\nmethods with Bolk-Croon-Hagenaars and maximum likelihood corrections and the\nmore recent two-step approach. These pseudo-likelihood estimators are presented\nin this paper under a unified framework as specific expectation-maximization\nsubroutines. To facilitate and promote their adoption among the data science\ncommunity, StepMix follows the object-oriented design of the scikit-learn\nlibrary and provides an additional R wrapper."}, "http://arxiv.org/abs/2310.06926": {"title": "Bayesian inference and cure rate modeling for event history data", "link": "http://arxiv.org/abs/2310.06926", "description": "Estimating model parameters of a general family of cure models is always a\nchallenging task mainly due to flatness and multimodality of the likelihood\nfunction. In this work, we propose a fully Bayesian approach in order to\novercome these issues. Posterior inference is carried out by constructing a\nMetropolis-coupled Markov chain Monte Carlo (MCMC) sampler, which combines\nGibbs sampling for the latent cure indicators and Metropolis-Hastings steps\nwith Langevin diffusion dynamics for parameter updates. The main MCMC algorithm\nis embedded within a parallel tempering scheme by considering heated versions\nof the target posterior distribution. It is demonstrated via simulations that\nthe proposed algorithm freely explores the multimodal posterior distribution\nand produces robust point estimates, while it outperforms maximum likelihood\nestimation via the Expectation-Maximization algorithm. A by-product of our\nBayesian implementation is to control the False Discovery Rate when classifying\nitems as cured or not. Finally, the proposed method is illustrated in a real\ndataset which refers to recidivism for offenders released from prison; the\nevent of interest is whether the offender was re-incarcerated after probation\nor not."}, "http://arxiv.org/abs/2310.06969": {"title": "Positivity-free Policy Learning with Observational Data", "link": "http://arxiv.org/abs/2310.06969", "description": "Policy learning utilizing observational data is pivotal across various\ndomains, with the objective of learning the optimal treatment assignment policy\nwhile adhering to specific constraints such as fairness, budget, and\nsimplicity. This study introduces a novel positivity-free (stochastic) policy\nlearning framework designed to address the challenges posed by the\nimpracticality of the positivity assumption in real-world scenarios. This\nframework leverages incremental propensity score policies to adjust propensity\nscore values instead of assigning fixed values to treatments. We characterize\nthese incremental propensity score policies and establish identification\nconditions, employing semiparametric efficiency theory to propose efficient\nestimators capable of achieving rapid convergence rates, even when integrated\nwith advanced machine learning algorithms. This paper provides a thorough\nexploration of the theoretical guarantees associated with policy learning and\nvalidates the proposed framework's finite-sample performance through\ncomprehensive numerical experiments, ensuring the identification of causal\neffects from observational data is both robust and reliable."}, "http://arxiv.org/abs/2310.07002": {"title": "Bayesian cross-validation by parallel Markov Chain Monte Carlo", "link": "http://arxiv.org/abs/2310.07002", "description": "Brute force cross-validation (CV) is a method for predictive assessment and\nmodel selection that is general and applicable to a wide range of Bayesian\nmodels. However, in many cases brute force CV is too computationally burdensome\nto form part of interactive modeling workflows, especially when inference\nrelies on Markov chain Monte Carlo (MCMC). In this paper we present a method\nfor conducting fast Bayesian CV by massively parallel MCMC. On suitable\naccelerator hardware, for many applications our approach is about as fast (in\nwall clock time) as a single full-data model fit.\n\nParallel CV is more flexible than existing fast CV approximation methods\nbecause it can easily exploit a wide range of scoring rules and data\npartitioning schemes. This is particularly useful for CV methods designed for\nnon-exchangeable data. Our approach also delivers accurate estimates of Monte\nCarlo and CV uncertainty. In addition to parallelizing computations, parallel\nCV speeds up inference by reusing information from earlier MCMC adaptation and\ninference obtained during initial model fitting and checking of the full-data\nmodel.\n\nWe propose MCMC diagnostics for parallel CV applications, including a summary\nof MCMC mixing based on the popular potential scale reduction factor\n($\\hat{R}$) and MCMC effective sample size ($\\widehat{ESS}$) measures.\nFurthermore, we describe a method for determining whether an $\\hat{R}$\ndiagnostic indicates approximate stationarity of the chains, that may be of\nmore general interest for applications beyond parallel CV.\n\nFor parallel CV to work on memory-constrained computing accelerators, we show\nthat parallel CV and associated diagnostics can be implemented using online\n(streaming) algorithms ideal for parallel computing environments with limited\nmemory. Constant memory algorithms allow parallel CV to scale up to very large\nblocking designs."}, "http://arxiv.org/abs/2310.07016": {"title": "Discovering the Unknowns: A First Step", "link": "http://arxiv.org/abs/2310.07016", "description": "This article aims at discovering the unknown variables in the system through\ndata analysis. The main idea is to use the time of data collection as a\nsurrogate variable and try to identify the unknown variables by modeling\ngradual and sudden changes in the data. We use Gaussian process modeling and a\nsparse representation of the sudden changes to efficiently estimate the large\nnumber of parameters in the proposed statistical model. The method is tested on\na realistic dataset generated using a one-dimensional implementation of a\nMagnetized Liner Inertial Fusion (MagLIF) simulation model and encouraging\nresults are obtained."}, "http://arxiv.org/abs/2310.07107": {"title": "Root n consistent extremile regression and its supervised and semi-supervised learning", "link": "http://arxiv.org/abs/2310.07107", "description": "Extremile (Daouia, Gijbels and Stupfler,2019) is a novel and coherent measure\nof risk, determined by weighted expectations rather than tail probabilities. It\nfinds application in risk management, and, in contrast to quantiles, it\nfulfills the axioms of consistency, taking into account the severity of tail\nlosses. However, existing studies (Daouia, Gijbels and Stupfler,2019,2022) on\nextremile involve unknown distribution functions, making it challenging to\nobtain a root n-consistent estimator for unknown parameters in linear extremile\nregression. This article introduces a new definition of linear extremile\nregression and its estimation method, where the estimator is root n-consistent.\nAdditionally, while the analysis of unlabeled data for extremes presents a\nsignificant challenge and is currently a topic of great interest in machine\nlearning for various classification problems, we have developed a\nsemi-supervised framework for the proposed extremile regression using unlabeled\ndata. This framework can also enhance estimation accuracy under model\nmisspecification. Both simulations and real data analyses have been conducted\nto illustrate the finite sample performance of the proposed methods."}, "http://arxiv.org/abs/2310.07124": {"title": "Systematic simulation of age-period-cohort analysis: Demonstrating bias of Bayesian regularization", "link": "http://arxiv.org/abs/2310.07124", "description": "Age-period-cohort (APC) analysis is one of the fundamental time-series\nanalyses used in the social sciences. This paper evaluates APC analysis via\nsystematic simulation in term of how well the artificial parameters are\nrecovered. We consider three models of Bayesian regularization using normal\nprior distributions: the random effects model with reference to multilevel\nanalysis, the ridge regression model equivalent to the intrinsic estimator, and\nthe random walk model referred to as the Bayesian cohort model. The proposed\nsimulation generates artificial data through combinations of the linear\ncomponents, focusing on the fact that the identification problem affects the\nlinear components of the three effects. Among the 13 cases of artificial data,\nthe random walk model recovered the artificial parameters well in 10 cases,\nwhile the random effects model and the ridge regression model did so in 4\ncases. The cases in which the models failed to recover the artificial\nparameters show the estimated linear component of the cohort effects as close\nto zero. In conclusion, the models of Bayesian regularization in APC analysis\nhave a bias: the index weights have a large influence on the cohort effects and\nthese constraints drive the linear component of the cohort effects close to\nzero. However, the random walk model mitigates underestimating the linear\ncomponent of the cohort effects."}, "http://arxiv.org/abs/2310.07330": {"title": "Functional Generalized Canonical Correlation Analysis for studying multiple longitudinal variables", "link": "http://arxiv.org/abs/2310.07330", "description": "In this paper, we introduce Functional Generalized Canonical Correlation\nAnalysis (FGCCA), a new framework for exploring associations between multiple\nrandom processes observed jointly. The framework is based on the multiblock\nRegularized Generalized Canonical Correlation Analysis (RGCCA) framework. It is\nrobust to sparsely and irregularly observed data, making it applicable in many\nsettings. We establish the monotonic property of the solving procedure and\nintroduce a Bayesian approach for estimating canonical components. We propose\nan extension of the framework that allows the integration of a univariate or\nmultivariate response into the analysis, paving the way for predictive\napplications. We evaluate the method's efficiency in simulation studies and\npresent a use case on a longitudinal dataset."}, "http://arxiv.org/abs/2310.07364": {"title": "Statistical inference of high-dimensional vector autoregressive time series with non-i", "link": "http://arxiv.org/abs/2310.07364", "description": "Independent or i.i.d. innovations is an essential assumption in the\nliterature for analyzing a vector time series. However, this assumption is\neither too restrictive for a real-life time series to satisfy or is hard to\nverify through a hypothesis test. This paper performs statistical inference on\na sparse high-dimensional vector autoregressive time series, allowing its white\nnoise innovations to be dependent, even non-stationary. To achieve this goal,\nit adopts a post-selection estimator to fit the vector autoregressive model and\nderives the asymptotic distribution of the post-selection estimator. The\ninnovations in the autoregressive time series are not assumed to be\nindependent, thus making the covariance matrices of the autoregressive\ncoefficient estimators complex and difficult to estimate. Our work develops a\nbootstrap algorithm to facilitate practitioners in performing statistical\ninference without having to engage in sophisticated calculations. Simulations\nand real-life data experiments reveal the validity of the proposed methods and\ntheoretical results.\n\nReal-life data is rarely considered to exactly satisfy an autoregressive\nmodel with independent or i.i.d. innovations, so our work should better reflect\nthe reality compared to the literature that assumes i.i.d. innovations."}, "http://arxiv.org/abs/2310.07399": {"title": "Randomized Runge-Kutta-Nystr\\\"om", "link": "http://arxiv.org/abs/2310.07399", "description": "We present 5/2- and 7/2-order $L^2$-accurate randomized Runge-Kutta-Nystr\\\"om\nmethods to approximate the Hamiltonian flow underlying various non-reversible\nMarkov chain Monte Carlo chains including unadjusted Hamiltonian Monte Carlo\nand unadjusted kinetic Langevin chains. Quantitative 5/2-order $L^2$-accuracy\nupper bounds are provided under gradient and Hessian Lipschitz assumptions on\nthe potential energy function. The superior complexity of the corresponding\nMarkov chains is numerically demonstrated for a selection of `well-behaved',\nhigh-dimensional target distributions."}, "http://arxiv.org/abs/2310.07456": {"title": "Hierarchical Bayesian Claim Count modeling with Overdispersed Outcome and Mismeasured Covariates in Actuarial Practice", "link": "http://arxiv.org/abs/2310.07456", "description": "The problem of overdispersed claim counts and mismeasured covariates is\ncommon in insurance. On the one hand, the presence of overdispersion in the\ncount data violates the homogeneity assumption, and on the other hand,\nmeasurement errors in covariates highlight the model risk issue in actuarial\npractice. The consequence can be inaccurate premium pricing which would\nnegatively affect business competitiveness. Our goal is to address these two\nmodelling problems simultaneously by capturing the unobservable correlations\nbetween observations that arise from overdispersed outcome and mismeasured\ncovariate in actuarial process. To this end, we establish novel connections\nbetween the count-based generalized linear mixed model (GLMM) and a popular\nerror-correction tool for non-linear modelling - Simulation Extrapolation\n(SIMEX). We consider a modelling framework based on the hierarchical Bayesian\nparadigm. To our knowledge, the approach of combining a hierarchical Bayes with\nSIMEX has not previously been discussed in the literature. We demonstrate the\napplicability of our approach on the workplace absenteeism data. Our results\nindicate that the hierarchical Bayesian GLMM incorporated with the SIMEX\noutperforms naive GLMM / SIMEX in terms of goodness of fit."}, "http://arxiv.org/abs/2310.07567": {"title": "Comparing the effectiveness of k-different treatments through the area under the ROC curve", "link": "http://arxiv.org/abs/2310.07567", "description": "The area under the receiver-operating characteristic curve (AUC) has become a\npopular index not only for measuring the overall prediction capacity of a\nmarker but also the association strength between continuous and binary\nvariables. In the current study, it has been used for comparing the association\nsize of four different interventions involving impulsive decision making,\nstudied through an animal model, in which each animal provides several negative\n(pre-treatment) and positive (post-treatment) measures. The problem of the full\ncomparison of the average AUCs arises therefore in a natural way. We construct\nan analysis of variance (ANOVA) type test for testing the equality of the\nimpact of these treatments measured through the respective AUCs, and\nconsidering the random-effect represented by the animal. The use (and\ndevelopment) of a post-hoc Tukey's HSD type test is also considered. We explore\nthe finite-sample behavior of our proposal via Monte Carlo simulations, and\nanalyze the data generated from the original problem. An R package implementing\nthe procedures is provided as supplementary material."}, "http://arxiv.org/abs/2310.07605": {"title": "Split Knockoffs for Multiple Comparisons: Controlling the Directional False Discovery Rate", "link": "http://arxiv.org/abs/2310.07605", "description": "Multiple comparisons in hypothesis testing often encounter structural\nconstraints in various applications. For instance, in structural Magnetic\nResonance Imaging for Alzheimer's Disease, the focus extends beyond examining\natrophic brain regions to include comparisons of anatomically adjacent regions.\nThese constraints can be modeled as linear transformations of parameters, where\nthe sign patterns play a crucial role in estimating directional effects. This\nclass of problems, encompassing total variations, wavelet transforms, fused\nLASSO, trend filtering, and more, presents an open challenge in effectively\ncontrolling the directional false discovery rate. In this paper, we propose an\nextended Split Knockoff method specifically designed to address the control of\ndirectional false discovery rate under linear transformations. Our proposed\napproach relaxes the stringent linear manifold constraint to its neighborhood,\nemploying a variable splitting technique commonly used in optimization. This\nmethodology yields an orthogonal design that benefits both power and\ndirectional false discovery rate control. By incorporating a sample splitting\nscheme, we achieve effective control of the directional false discovery rate,\nwith a notable reduction to zero as the relaxed neighborhood expands. To\ndemonstrate the efficacy of our method, we conduct simulation experiments and\napply it to two real-world scenarios: Alzheimer's Disease analysis and human\nage comparisons."}, "http://arxiv.org/abs/2310.07680": {"title": "Hamiltonian Dynamics of Bayesian Inference Formalised by Arc Hamiltonian Systems", "link": "http://arxiv.org/abs/2310.07680", "description": "This paper makes two theoretical contributions. First, we establish a novel\nclass of Hamiltonian systems, called arc Hamiltonian systems, for saddle\nHamiltonian functions over infinite-dimensional metric spaces. Arc Hamiltonian\nsystems generate a flow that satisfies the law of conservation of energy\neverywhere in a metric space. They are governed by an extension of Hamilton's\nequation formulated based on (i) the framework of arc fields and (ii) an\ninfinite-dimensional gradient, termed the arc gradient, of a Hamiltonian\nfunction. We derive conditions for the existence of a flow generated by an arc\nHamiltonian system, showing that they reduce to local Lipschitz continuity of\nthe arc gradient under sufficient regularity. Second, we present two\nHamiltonian functions, called the cumulant generating functional and the\ncentred cumulant generating functional, over a metric space of log-likelihoods\nand measures. The former characterises the posterior of Bayesian inference as a\npart of the arc gradient that induces a flow of log-likelihoods and\nnon-negative measures. The latter characterises the difference of the posterior\nand the prior as a part of the arc gradient that induces a flow of\nlog-likelihoods and probability measures. Our results reveal an implication of\nthe belief updating mechanism from the prior to the posterior as an\ninfinitesimal change of a measure in the infinite-dimensional Hamiltonian\nflows."}, "http://arxiv.org/abs/2009.12217": {"title": "Latent Causal Socioeconomic Health Index", "link": "http://arxiv.org/abs/2009.12217", "description": "This research develops a model-based LAtent Causal Socioeconomic Health\n(LACSH) index at the national level. Motivated by the need for a holistic\nnational well-being index, we build upon the latent health factor index (LHFI)\napproach that has been used to assess the unobservable ecological/ecosystem\nhealth. LHFI integratively models the relationship between metrics, latent\nhealth, and covariates that drive the notion of health. In this paper, the LHFI\nstructure is integrated with spatial modeling and statistical causal modeling.\nOur efforts are focused on developing the integrated framework to facilitate\nthe understanding of how an observational continuous variable might have\ncausally affected a latent trait that exhibits spatial correlation. A novel\nvisualization technique to evaluate covariate balance is also introduced for\nthe case of a continuous policy (treatment) variable. Our resulting LACSH\nframework and visualization tool are illustrated through two global case\nstudies on national socioeconomic health (latent trait), each with various\nmetrics and covariates pertaining to different aspects of societal health, and\nthe treatment variable being mandatory maternity leave days and government\nexpenditure on healthcare, respectively. We validate our model by two\nsimulation studies. All approaches are structured in a Bayesian hierarchical\nframework and results are obtained by Markov chain Monte Carlo techniques."}, "http://arxiv.org/abs/2201.02958": {"title": "Smooth Nested Simulation: Bridging Cubic and Square Root Convergence Rates in High Dimensions", "link": "http://arxiv.org/abs/2201.02958", "description": "Nested simulation concerns estimating functionals of a conditional\nexpectation via simulation. In this paper, we propose a new method based on\nkernel ridge regression to exploit the smoothness of the conditional\nexpectation as a function of the multidimensional conditioning variable.\nAsymptotic analysis shows that the proposed method can effectively alleviate\nthe curse of dimensionality on the convergence rate as the simulation budget\nincreases, provided that the conditional expectation is sufficiently smooth.\nThe smoothness bridges the gap between the cubic root convergence rate (that\nis, the optimal rate for the standard nested simulation) and the square root\nconvergence rate (that is, the canonical rate for the standard Monte Carlo\nsimulation). We demonstrate the performance of the proposed method via\nnumerical examples from portfolio risk management and input uncertainty\nquantification."}, "http://arxiv.org/abs/2204.12635": {"title": "Multivariate and regression models for directional data based on projected P\\'olya trees", "link": "http://arxiv.org/abs/2204.12635", "description": "Projected distributions have proved to be useful in the study of circular and\ndirectional data. Although any multivariate distribution can be used to produce\na projected model, these distributions are typically parametric. In this\narticle we consider a multivariate P\\'olya tree on $R^k$ and project it to the\nunit hypersphere $S^k$ to define a new Bayesian nonparametric model for\ndirectional data. We study the properties of the proposed model and in\nparticular, concentrate on the implied conditional distributions of some\ndirections given the others to define a directional-directional regression\nmodel. We also define a multivariate linear regression model with P\\'olya tree\nerror and project it to define a linear-directional regression model. We obtain\nthe posterior characterisation of all models and show their performance with\nsimulated and real datasets."}, "http://arxiv.org/abs/2207.13250": {"title": "Spatio-Temporal Wildfire Prediction using Multi-Modal Data", "link": "http://arxiv.org/abs/2207.13250", "description": "Due to severe societal and environmental impacts, wildfire prediction using\nmulti-modal sensing data has become a highly sought-after data-analytical tool\nby various stakeholders (such as state governments and power utility companies)\nto achieve a more informed understanding of wildfire activities and plan\npreventive measures. A desirable algorithm should precisely predict fire risk\nand magnitude for a location in real time. In this paper, we develop a flexible\nspatio-temporal wildfire prediction framework using multi-modal time series\ndata. We first predict the wildfire risk (the chance of a wildfire event) in\nreal-time, considering the historical events using discrete mutually exciting\npoint process models. Then we further develop a wildfire magnitude prediction\nset method based on the flexible distribution-free time-series conformal\nprediction (CP) approach. Theoretically, we prove a risk model parameter\nrecovery guarantee, as well as coverage and set size guarantees for the CP\nsets. Through extensive real-data experiments with wildfire data in California,\nwe demonstrate the effectiveness of our methods, as well as their flexibility\nand scalability in large regions."}, "http://arxiv.org/abs/2210.13550": {"title": "Regularized Nonlinear Regression with Dependent Errors and its Application to a Biomechanical Model", "link": "http://arxiv.org/abs/2210.13550", "description": "A biomechanical model often requires parameter estimation and selection in a\nknown but complicated nonlinear function. Motivated by observing that data from\na head-neck position tracking system, one of biomechanical models, show\nmultiplicative time dependent errors, we develop a modified penalized weighted\nleast squares estimator. The proposed method can be also applied to a model\nwith non-zero mean time dependent additive errors. Asymptotic properties of the\nproposed estimator are investigated under mild conditions on a weight matrix\nand the error process. A simulation study demonstrates that the proposed\nestimation works well in both parameter estimation and selection with time\ndependent error. The analysis and comparison with an existing method for\nhead-neck position tracking data show better performance of the proposed method\nin terms of the variance accounted for (VAF)."}, "http://arxiv.org/abs/2210.14965": {"title": "Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions", "link": "http://arxiv.org/abs/2210.14965", "description": "This paper adopts a tool from computational topology, the Euler\ncharacteristic curve (ECC) of a sample, to perform one- and two-sample goodness\nof fit tests. We call our procedure TopoTests. The presented tests work for\nsamples of arbitrary dimension, having comparable power to the state-of-the-art\ntests in the one-dimensional case. It is demonstrated that the type I error of\nTopoTests can be controlled and their type II error vanishes exponentially with\nincreasing sample size. Extensive numerical simulations of TopoTests are\nconducted to demonstrate their power for samples of various sizes."}, "http://arxiv.org/abs/2211.03860": {"title": "Automatic Change-Point Detection in Time Series via Deep Learning", "link": "http://arxiv.org/abs/2211.03860", "description": "Detecting change-points in data is challenging because of the range of\npossible types of change and types of behaviour of data when there is no\nchange. Statistically efficient methods for detecting a change will depend on\nboth of these features, and it can be difficult for a practitioner to develop\nan appropriate detection method for their application of interest. We show how\nto automatically generate new offline detection methods based on training a\nneural network. Our approach is motivated by many existing tests for the\npresence of a change-point being representable by a simple neural network, and\nthus a neural network trained with sufficient data should have performance at\nleast as good as these methods. We present theory that quantifies the error\nrate for such an approach, and how it depends on the amount of training data.\nEmpirical results show that, even with limited training data, its performance\nis competitive with the standard CUSUM-based classifier for detecting a change\nin mean when the noise is independent and Gaussian, and can substantially\noutperform it in the presence of auto-correlated or heavy-tailed noise. Our\nmethod also shows strong results in detecting and localising changes in\nactivity based on accelerometer data."}, "http://arxiv.org/abs/2211.09099": {"title": "Selecting Subpopulations for Causal Inference in Regression Discontinuity Designs", "link": "http://arxiv.org/abs/2211.09099", "description": "The Brazil Bolsa Familia (BF) program is a conditional cash transfer program\naimed to reduce short-term poverty by direct cash transfers and to fight\nlong-term poverty by increasing human capital among poor Brazilian people.\nEligibility for Bolsa Familia benefits depends on a cutoff rule, which\nclassifies the BF study as a regression discontinuity (RD) design. Extracting\ncausal information from RD studies is challenging. Following Li et al (2015)\nand Branson and Mealli (2019), we formally describe the BF RD design as a local\nrandomized experiment within the potential outcome approach. Under this\nframework, causal effects can be identified and estimated on a subpopulation\nwhere a local overlap assumption, a local SUTVA and a local ignorability\nassumption hold. We first discuss the potential advantages of this framework\nover local regression methods based on continuity assumptions, which concern\nthe definition of the causal estimands, the design and the analysis of the\nstudy, and the interpretation and generalizability of the results. A critical\nissue of this local randomization approach is how to choose subpopulations for\nwhich we can draw valid causal inference. We propose a Bayesian model-based\nfinite mixture approach to clustering to classify observations into\nsubpopulations where the RD assumptions hold and do not hold. This approach has\nimportant advantages: a) it allows to account for the uncertainty in the\nsubpopulation membership, which is typically neglected; b) it does not impose\nany constraint on the shape of the subpopulation; c) it is scalable to\nhigh-dimensional settings; e) it allows to target alternative causal estimands\nthan the average treatment effect (ATE); and f) it is robust to a certain\ndegree of manipulation/selection of the running variable. We apply our proposed\napproach to assess causal effects of the Bolsa Familia program on leprosy\nincidence in 2009."}, "http://arxiv.org/abs/2301.08276": {"title": "Cross-validatory model selection for Bayesian autoregressions with exogenous regressors", "link": "http://arxiv.org/abs/2301.08276", "description": "Bayesian cross-validation (CV) is a popular method for predictive model\nassessment that is simple to implement and broadly applicable. A wide range of\nCV schemes is available for time series applications, including generic\nleave-one-out (LOO) and K-fold methods, as well as specialized approaches\nintended to deal with serial dependence such as leave-future-out (LFO),\nh-block, and hv-block.\n\nExisting large-sample results show that both specialized and generic methods\nare applicable to models of serially-dependent data. However, large sample\nconsistency results overlook the impact of sampling variability on accuracy in\nfinite samples. Moreover, the accuracy of a CV scheme depends on many aspects\nof the procedure. We show that poor design choices can lead to elevated rates\nof adverse selection.\n\nIn this paper, we consider the problem of identifying the regression\ncomponent of an important class of models of data with serial dependence,\nautoregressions of order p with q exogenous regressors (ARX(p,q)), under the\nlogarithmic scoring rule. We show that when serial dependence is present,\nscores computed using the joint (multivariate) density have lower variance and\nbetter model selection accuracy than the popular pointwise estimator. In\naddition, we present a detailed case study of the special case of ARX models\nwith fixed autoregressive structure and variance. For this class, we derive the\nfinite-sample distribution of the CV estimators and the model selection\nstatistic. We conclude with recommendations for practitioners."}, "http://arxiv.org/abs/2301.12026": {"title": "G-formula for causal inference via multiple imputation", "link": "http://arxiv.org/abs/2301.12026", "description": "G-formula is a popular approach for estimating treatment or exposure effects\nfrom longitudinal data that are subject to time-varying confounding. G-formula\nestimation is typically performed by Monte-Carlo simulation, with\nnon-parametric bootstrapping used for inference. We show that G-formula can be\nimplemented by exploiting existing methods for multiple imputation (MI) for\nsynthetic data. This involves using an existing modified version of Rubin's\nvariance estimator. In practice missing data is ubiquitous in longitudinal\ndatasets. We show that such missing data can be readily accommodated as part of\nthe MI procedure when using G-formula, and describe how MI software can be used\nto implement the approach. We explore its performance using a simulation study\nand an application from cystic fibrosis."}, "http://arxiv.org/abs/2306.01292": {"title": "Alternative Measures of Direct and Indirect Effects", "link": "http://arxiv.org/abs/2306.01292", "description": "There are a number of measures of direct and indirect effects in the\nliterature. They are suitable in some cases and unsuitable in others. We\ndescribe a case where the existing measures are unsuitable and propose new\nsuitable ones. We also show that the new measures can partially handle\nunmeasured treatment-outcome confounding, and bound long-term effects by\ncombining experimental and observational data."}, "http://arxiv.org/abs/2308.00913": {"title": "The Bayesian Context Trees State Space Model for time series modelling and forecasting", "link": "http://arxiv.org/abs/2308.00913", "description": "A hierarchical Bayesian framework is introduced for developing rich mixture\nmodels for real-valued time series, partly motivated by important applications\nin financial time series analysis. At the top level, meaningful discrete states\nare identified as appropriately quantised values of some of the most recent\nsamples. These observable states are described as a discrete context-tree\nmodel. At the bottom level, a different, arbitrary model for real-valued time\nseries -- a base model -- is associated with each state. This defines a very\ngeneral framework that can be used in conjunction with any existing model class\nto build flexible and interpretable mixture models. We call this the Bayesian\nContext Trees State Space Model, or the BCT-X framework. Efficient algorithms\nare introduced that allow for effective, exact Bayesian inference and learning\nin this setting; in particular, the maximum a posteriori probability (MAP)\ncontext-tree model can be identified. These algorithms can be updated\nsequentially, facilitating efficient online forecasting. The utility of the\ngeneral framework is illustrated in two particular instances: When\nautoregressive (AR) models are used as base models, resulting in a nonlinear AR\nmixture model, and when conditional heteroscedastic (ARCH) models are used,\nresulting in a mixture model that offers a powerful and systematic way of\nmodelling the well-known volatility asymmetries in financial data. In\nforecasting, the BCT-X methods are found to outperform state-of-the-art\ntechniques on simulated and real-world data, both in terms of accuracy and\ncomputational requirements. In modelling, the BCT-X structure finds natural\nstructure present in the data. In particular, the BCT-ARCH model reveals a\nnovel, important feature of stock market index data, in the form of an enhanced\nleverage effect."}, "http://arxiv.org/abs/2309.11942": {"title": "On the Probability of Immunity", "link": "http://arxiv.org/abs/2309.11942", "description": "This work is devoted to the study of the probability of immunity, i.e. the\neffect occurs whether exposed or not. We derive necessary and sufficient\nconditions for non-immunity and $\\epsilon$-bounded immunity, i.e. the\nprobability of immunity is zero and $\\epsilon$-bounded, respectively. The\nformer allows us to estimate the probability of benefit (i.e., the effect\noccurs if and only if exposed) from a randomized controlled trial, and the\nlatter allows us to produce bounds of the probability of benefit that are\ntighter than the existing ones. We also introduce the concept of indirect\nimmunity (i.e., through a mediator) and repeat our previous analysis for it.\nFinally, we propose a method for sensitivity analysis of the probability of\nimmunity under unmeasured confounding."}, "http://arxiv.org/abs/2309.13441": {"title": "Anytime valid and asymptotically optimal statistical inference driven by predictive recursion", "link": "http://arxiv.org/abs/2309.13441", "description": "Distinguishing two classes of candidate models is a fundamental and\npractically important problem in statistical inference. Error rate control is\ncrucial to the logic but, in complex nonparametric settings, such guarantees\ncan be difficult to achieve, especially when the stopping rule that determines\nthe data collection process is not available. In this paper we develop a novel\ne-process construction that leverages the so-called predictive recursion (PR)\nalgorithm designed to rapidly and recursively fit nonparametric mixture models.\nThe resulting PRe-process affords anytime valid inference uniformly over\nstopping rules and is shown to be efficient in the sense that it achieves the\nmaximal growth rate under the alternative relative to the mixture model being\nfit by PR. In the special case of testing for a log-concave density, the\nPRe-process test is computationally simpler and faster, more stable, and no\nless efficient compared to a recently proposed anytime valid test."}, "http://arxiv.org/abs/2309.16598": {"title": "Cross-Prediction-Powered Inference", "link": "http://arxiv.org/abs/2309.16598", "description": "While reliable data-driven decision-making hinges on high-quality labeled\ndata, the acquisition of quality labels often involves laborious human\nannotations or slow and expensive scientific measurements. Machine learning is\nbecoming an appealing alternative as sophisticated predictive techniques are\nbeing used to quickly and cheaply produce large amounts of predicted labels;\ne.g., predicted protein structures are used to supplement experimentally\nderived structures, predictions of socioeconomic indicators from satellite\nimagery are used to supplement accurate survey data, and so on. Since\npredictions are imperfect and potentially biased, this practice brings into\nquestion the validity of downstream inferences. We introduce cross-prediction:\na method for valid inference powered by machine learning. With a small labeled\ndataset and a large unlabeled dataset, cross-prediction imputes the missing\nlabels via machine learning and applies a form of debiasing to remedy the\nprediction inaccuracies. The resulting inferences achieve the desired error\nprobability and are more powerful than those that only leverage the labeled\ndata. Closely related is the recent proposal of prediction-powered inference,\nwhich assumes that a good pre-trained model is already available. We show that\ncross-prediction is consistently more powerful than an adaptation of\nprediction-powered inference in which a fraction of the labeled data is split\noff and used to train the model. Finally, we observe that cross-prediction\ngives more stable conclusions than its competitors; its confidence intervals\ntypically have significantly lower variability."}}
\ No newline at end of file