From fc7f6d2224aed55537bf91b67271d9d1dc17eb60 Mon Sep 17 00:00:00 2001
From: Apoorva Lal <lal.apoorva@gmail.com>
Date: Tue, 10 Oct 2023 12:49:07 -0700
Subject: [PATCH] make sure archiving works instantaneously.

---
 econ_em_draws.json |  2 +-
 paperbot.py        | 77 ++++++++++++++++++++++++++--------------------
 stat_me_draws.json |  2 +-
 3 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/econ_em_draws.json b/econ_em_draws.json
index 9742438..fec994c 100644
--- a/econ_em_draws.json
+++ b/econ_em_draws.json
@@ -1 +1 @@
-{"http://arxiv.org/abs/2310.03435": {"title": "Variational Inference for GARCH-family Models", "link": "http://arxiv.org/abs/2310.03435", "description": "The Bayesian estimation of GARCH-family models has been typically addressed\nthrough Monte Carlo sampling. Variational Inference is gaining popularity and\nattention as a robust approach for Bayesian inference in complex machine\nlearning models; however, its adoption in econometrics and finance is limited.\nThis paper discusses the extent to which Variational Inference constitutes a\nreliable and feasible alternative to Monte Carlo sampling for Bayesian\ninference in GARCH-like models. Through a large-scale experiment involving the\nconstituents of the S&amp;P 500 index, several Variational Inference optimizers, a\nvariety of volatility models, and a case study, we show that Variational\nInference is an attractive, remarkably well-calibrated, and competitive method\nfor Bayesian learning."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2205.04345": {"title": "Joint diagnostic test of regression discontinuity designs: multiple testing problem", "link": "http://arxiv.org/abs/2205.04345", "description": "Current diagnostic tests for regression discontinuity (RD) design face a\nmultiple testing problem. We find a massive over-rejection of the identifying\nrestriction among empirical RD studies published in top-five economics\njournals. Each test achieves a nominal size of 5%; however, the median number\nof tests per study is 12. Consequently, more than one-third of studies reject\nat least one of these tests and their diagnostic procedures are invalid for\njustifying the identifying assumption. We offer a joint testing procedure to\nresolve the multiple testing problem. Our procedure is based on a new joint\nasymptotic normality of local linear estimates and local polynomial density\nestimates. In simulation studies, our joint testing procedures outperform the\nBonferroni correction. We implement the procedure as an R package, rdtest, with\ntwo empirical examples in its vignettes."}, "http://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "http://arxiv.org/abs/2212.04620", "description": "Production functions are potentially misspecified when revenue is used as a\nproxy for output. I formalize and strengthen this common knowledge by showing\nthat neither the production function nor Hicks-neutral productivity can be\nidentified with such a revenue proxy. This result holds under the standard\nassumptions used in the literature for a large class of production functions,\nincluding all commonly used parametric forms. Among the prevalent approaches to\naddress this issue, only those that impose assumptions on the underlying demand\nsystem can possibly identify the production function."}, "http://arxiv.org/abs/2307.13364": {"title": "Tuning-free testing of factor regression against factor-augmented sparse alternatives", "link": "http://arxiv.org/abs/2307.13364", "description": "This study introduces a bootstrap test of the validity of factor regression\nwithin a high-dimensional factor-augmented sparse regression model that\nintegrates factor and sparse regression techniques. The test provides a means\nto assess the suitability of the classical dense factor regression model\ncompared to a sparse plus dense alternative augmenting factor regression with\nidiosyncratic shocks. Our proposed test does not require tuning parameters,\neliminates the need to estimate covariance matrices, and offers simplicity in\nimplementation. The validity of the test is theoretically established under\ntime-series dependence. Through simulation experiments, we demonstrate the\nfavorable finite sample performance of our procedure. Moreover, using the\nFRED-MD dataset, we apply the test and reject the adequacy of the classical\nfactor regression model when the dependent variable is inflation but not when\nit is industrial production. These findings offer insights into selecting\nappropriate models for high-dimensional datasets."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}}
+{"http://arxiv.org/abs/2310.03435": {"title": "Variational Inference for GARCH-family Models", "link": "http://arxiv.org/abs/2310.03435", "description": "The Bayesian estimation of GARCH-family models has been typically addressed\nthrough Monte Carlo sampling. Variational Inference is gaining popularity and\nattention as a robust approach for Bayesian inference in complex machine\nlearning models; however, its adoption in econometrics and finance is limited.\nThis paper discusses the extent to which Variational Inference constitutes a\nreliable and feasible alternative to Monte Carlo sampling for Bayesian\ninference in GARCH-like models. Through a large-scale experiment involving the\nconstituents of the S&amp;P 500 index, several Variational Inference optimizers, a\nvariety of volatility models, and a case study, we show that Variational\nInference is an attractive, remarkably well-calibrated, and competitive method\nfor Bayesian learning."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2205.04345": {"title": "Joint diagnostic test of regression discontinuity designs: multiple testing problem", "link": "http://arxiv.org/abs/2205.04345", "description": "Current diagnostic tests for regression discontinuity (RD) design face a\nmultiple testing problem. We find a massive over-rejection of the identifying\nrestriction among empirical RD studies published in top-five economics\njournals. Each test achieves a nominal size of 5%; however, the median number\nof tests per study is 12. Consequently, more than one-third of studies reject\nat least one of these tests and their diagnostic procedures are invalid for\njustifying the identifying assumption. We offer a joint testing procedure to\nresolve the multiple testing problem. Our procedure is based on a new joint\nasymptotic normality of local linear estimates and local polynomial density\nestimates. In simulation studies, our joint testing procedures outperform the\nBonferroni correction. We implement the procedure as an R package, rdtest, with\ntwo empirical examples in its vignettes."}, "http://arxiv.org/abs/2212.04620": {"title": "On the Non-Identification of Revenue Production Functions", "link": "http://arxiv.org/abs/2212.04620", "description": "Production functions are potentially misspecified when revenue is used as a\nproxy for output. I formalize and strengthen this common knowledge by showing\nthat neither the production function nor Hicks-neutral productivity can be\nidentified with such a revenue proxy. This result holds under the standard\nassumptions used in the literature for a large class of production functions,\nincluding all commonly used parametric forms. Among the prevalent approaches to\naddress this issue, only those that impose assumptions on the underlying demand\nsystem can possibly identify the production function."}, "http://arxiv.org/abs/2307.13364": {"title": "Tuning-free testing of factor regression against factor-augmented sparse alternatives", "link": "http://arxiv.org/abs/2307.13364", "description": "This study introduces a bootstrap test of the validity of factor regression\nwithin a high-dimensional factor-augmented sparse regression model that\nintegrates factor and sparse regression techniques. The test provides a means\nto assess the suitability of the classical dense factor regression model\ncompared to a sparse plus dense alternative augmenting factor regression with\nidiosyncratic shocks. Our proposed test does not require tuning parameters,\neliminates the need to estimate covariance matrices, and offers simplicity in\nimplementation. The validity of the test is theoretically established under\ntime-series dependence. Through simulation experiments, we demonstrate the\nfavorable finite sample performance of our procedure. Moreover, using the\nFRED-MD dataset, we apply the test and reject the adequacy of the classical\nfactor regression model when the dependent variable is inflation but not when\nit is industrial production. These findings offer insights into selecting\nappropriate models for high-dimensional datasets."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2310.04576": {"title": "Finite Sample Performance of a Conduct Parameter Test in Homogenous Goods Markets", "link": "http://arxiv.org/abs/2310.04576", "description": "We assess the finite sample performance of the conduct parameter test in\nhomogeneous goods markets. Statistical power rises with an increase in the\nnumber of markets, a larger conduct parameter, and a stronger demand rotation\ninstrument. However, even with a moderate number of markets and five firms,\nregardless of instrument strength and the utilization of optimal instruments,\nrejecting the null hypothesis of perfect competition remains challenging. Our\nfindings indicate that empirical results that fail to reject perfect\ncompetition are a consequence of the limited number of markets rather than\nmethodological deficiencies."}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.05311": {"title": "Identification and Estimation in a Class of Potential Outcomes Models", "link": "http://arxiv.org/abs/2310.05311", "description": "This paper develops a class of potential outcomes models characterized by\nthree main features: (i) Unobserved heterogeneity can be represented by a\nvector of potential outcomes and a type describing the manner in which an\ninstrument determines the choice of treatment; (ii) The availability of an\ninstrumental variable that is conditionally independent of unobserved\nheterogeneity; and (iii) The imposition of convex restrictions on the\ndistribution of unobserved heterogeneity. The proposed class of models\nencompasses multiple classical and novel research designs, yet possesses a\ncommon structure that permits a unifying analysis of identification and\nestimation. In particular, we establish that these models share a common\nnecessary and sufficient condition for identifying certain causal parameters.\nOur identification results are constructive in that they yield estimating\nmoment conditions for the parameters of interest. Focusing on a leading special\ncase of our framework, we further show how these estimating moment conditions\nmay be modified to be doubly robust. The corresponding double robust estimators\nare shown to be asymptotically normally distributed, bootstrap based inference\nis shown to be asymptotically valid, and the semi-parametric efficiency bound\nis derived for those parameters that are root-n estimable. We illustrate the\nusefulness of our results for developing, identifying, and estimating causal\nmodels through an empirical evaluation of the role of mental health as a\nmediating variable in the Moving To Opportunity experiment."}, "http://arxiv.org/abs/2310.05761": {"title": "Robust Minimum Distance Inference in Structural Models", "link": "http://arxiv.org/abs/2310.05761", "description": "This paper proposes minimum distance inference for a structural parameter of\ninterest, which is robust to the lack of identification of other structural\nnuisance parameters. Some choices of the weighting matrix lead to asymptotic\nchi-squared distributions with degrees of freedom that can be consistently\nestimated from the data, even under partial identification. In any case,\nknowledge of the level of under-identification is not required. We study the\npower of our robust test. Several examples show the wide applicability of the\nprocedure and a Monte Carlo investigates its finite sample performance. Our\nidentification-robust inference method can be applied to make inferences on\nboth calibrated (fixed) parameters and any other structural parameter of\ninterest. We illustrate the method's usefulness by applying it to a structural\nmodel on the non-neutrality of monetary policy, as in \\cite{nakamura2018high},\nwhere we empirically evaluate the validity of the calibrated parameters and we\ncarry out robust inference on the slope of the Phillips curve and the\ninformation effect."}, "http://arxiv.org/abs/2302.13066": {"title": "Estimating Fiscal Multipliers by Combining Statistical Identification with Potentially Endogenous Proxies", "link": "http://arxiv.org/abs/2302.13066", "description": "Different proxy variables used in fiscal policy SVARs lead to contradicting\nconclusions regarding the size of fiscal multipliers. In this paper, we show\nthat the conflicting results are due to violations of the exogeneity\nassumptions, i.e. the commonly used proxies are endogenously related to the\nstructural shocks. We propose a novel approach to include proxy variables into\na Bayesian non-Gaussian SVAR, tailored to accommodate potentially endogenous\nproxy variables. Using our model, we show that increasing government spending\nis a more effective tool to stimulate the economy than reducing taxes. We\nconstruct new exogenous proxies that can be used in the traditional proxy VAR\napproach resulting in similar estimates compared to our proposed hybrid SVAR\nmodel."}, "http://arxiv.org/abs/2303.01863": {"title": "Constructing High Frequency Economic Indicators by Imputation", "link": "http://arxiv.org/abs/2303.01863", "description": "Monthly and weekly economic indicators are often taken to be the largest\ncommon factor estimated from high and low frequency data, either separately or\njointly. To incorporate mixed frequency information without directly modeling\nthem, we target a low frequency diffusion index that is already available, and\ntreat high frequency values as missing. We impute these values using multiple\nfactors estimated from the high frequency data. In the empirical examples\nconsidered, static matrix completion that does not account for serial\ncorrelation in the idiosyncratic errors yields imprecise estimates of the\nmissing values irrespective of how the factors are estimated. Single equation\nand systems-based dynamic procedures that account for serial correlation yield\nimputed values that are closer to the observed low frequency ones. This is the\ncase in the counterfactual exercise that imputes the monthly values of consumer\nsentiment series before 1978 when the data was released only on a quarterly\nbasis. This is also the case for a weekly version of the CFNAI index of\neconomic activity that is imputed using seasonally unadjusted data. The imputed\nseries reveals episodes of increased variability of weekly economic information\nthat are masked by the monthly data, notably around the 2014-15 collapse in oil\nprices."}}
\ No newline at end of file
diff --git a/paperbot.py b/paperbot.py
index 0de5365..3813e12 100644
--- a/paperbot.py
+++ b/paperbot.py
@@ -155,64 +155,75 @@ def get_arxiv_feed(subject: str):
     return res
 
 
+def get_and_write_feed_json(feedname: str, filename: str):
+    feed = get_arxiv_feed(feedname)
+    with open(filename, "r") as f:
+        archive = json.load(f)
+    new_archive = archive.copy()
+    # append new items
+    for k, v in feed.items():
+        if k not in archive:
+            new_archive[k] = v
+    # write out only if new items exist
+    if len(new_archive) > len(new_archive):
+        with open(filename, "w") as f:
+            json.dump(new_archive, f, indent=None)
+        print(f"{filename} updated")
+    return feed, archive
+
 # %%
 def main():
+    # query and write immediately
+    stats_pull, stat_me_archive = get_and_write_feed_json("stat.ME", "stat_me_draws.json")
+    em_pull, econ_em_archive = get_and_write_feed_json("econ.EM", "econ_em_draws.json")
     ######################################################################
     # stats
     ######################################################################
     # read existing data from "stat_me_draws.json" file
-    try:
-        with open("stat_me_draws.json", "r") as f:
-            stat_me_archive = json.load(f)
-    except FileNotFoundError:
-        stat_me_archive = {}
-    # Get new data from arxiv feed
-    new_pull = get_arxiv_feed("stat.ME")
     new_posts = 0
     # Append new data to existing data
-    for k, v in new_pull.items():
+    for k, v in stats_pull.items():
         if k not in stat_me_archive:  # if not already posted
-            create_post(f"{v['title']}\n{v['link']}\n{''.join(v['description'])}"[:297] + "\n📈🤖")
-            time.sleep(random.randint(120, 600))
+            create_post(
+                f"{v['title']}\n{v['link']}\n{''.join(v['description'])}"[:297] + "\n📈🤖"
+            )
+            time.sleep(random.randint(60, 300))
             stat_me_archive[k] = v
             new_posts += 1
     if new_posts == 0 & (len(stat_me_archive) > 2):
         print("No new papers found; posting random paper from archive")
         random_paper = random.choice(list(stat_me_archive.values()))
-        create_post(f"{random_paper['title']}\n{random_paper['link']}\n{''.join(random_paper['description'])}"[:297] + "\n📈🤖")
-        time.sleep(random.randint(120, 600))
-    else:
-        # Write updated data back to "stat_me_draws.json" file - once every run
-        with open("stat_me_draws.json", "w") as f:
-            json.dump(stat_me_archive, f, indent=None)
-        print("wrote stat_me_draws.json")
+        create_post(
+            f"{random_paper['title']}\n{random_paper['link']}\n{''.join(random_paper['description'])}"[
+                :297
+            ]
+            + "\n📈🤖"
+        )
+        time.sleep(random.randint(30, 60))
     ######################################################################
     # econometrics
     ######################################################################
-    try:
-        with open("econ_em_draws.json", "r") as f:
-            econ_em_archive = json.load(f)
-    except FileNotFoundError:
-        econ_em_archive = {}
-    # Get new data from arxiv feed
-    new_pull = get_arxiv_feed("econ.EM")
     new_posts = 0
     # Append new data to existing data
-    for k, v in new_pull.items():
+    for k, v in em_pull.items():
         if k not in econ_em_archive:
-            create_post(f"{v['title']}\n{v['link']}\n{''.join(v['description'])}"[:297] + "\n📈🤖")
-            time.sleep(random.randint(120, 600))
+            create_post(
+                f"{v['title']}\n{v['link']}\n{''.join(v['description'])}"[:297] + "\n📈🤖"
+            )
+            time.sleep(random.randint(60, 300))
             econ_em_archive[k] = v
             new_posts += 1
     if new_posts == 0 & (len(econ_em_archive) > 2):
         print("No new papers found; posting random paper from archive")
         random_paper = random.choice(list(econ_em_archive.values()))
-        create_post(f"{random_paper['title']}\n{random_paper['link']}\n{''.join(random_paper['description'])}"[:297] + "\n📈🤖")
-    else:
-        # Write updated data back to "econ_em_draws.json" file
-        with open("econ_em_draws.json", "w") as f:
-            json.dump(econ_em_archive, f, indent=None)
-        print("wrote econ_em_draws.json")
+        create_post(
+            f"{random_paper['title']}\n{random_paper['link']}\n{''.join(random_paper['description'])}"[
+                :297
+            ]
+            + "\n📈🤖"
+        )
+
+
 # %%
 if __name__ == "__main__":
     main()
diff --git a/stat_me_draws.json b/stat_me_draws.json
index d392513..76846f0 100644
--- a/stat_me_draws.json
+++ b/stat_me_draws.json
@@ -1 +1 @@
-{"http://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "http://arxiv.org/abs/2310.03114", "description": "In this article we consider Bayesian parameter inference for a type of\npartially observed stochastic Volterra equation (SVE). SVEs are found in many\nareas such as physics and mathematical finance. In the latter field they can be\nused to represent long memory in unobserved volatility processes. In many cases\nof practical interest, SVEs must be time-discretized and then parameter\ninference is based upon the posterior associated to this time-discretized\nprocess. Based upon recent studies on time-discretization of SVEs (e.g. Richard\net al. 2021), we use Euler-Maruyama methods for the afore-mentioned\ndiscretization. We then show how multilevel Markov chain Monte Carlo (MCMC)\nmethods (Jasra et al. 2018) can be applied in this context. In the examples we\nstudy, we give a proof that shows that the cost to achieve a mean square error\n(MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon&gt;0$, is\n$\\mathcal{O}(\\epsilon^{-20/9})$. If one uses a single level MCMC method then\nthe cost is $\\mathcal{O}(\\epsilon^{-38/9})$ to achieve the same MSE. We\nillustrate these results in the context of state-space and stochastic\nvolatility models, with the latter applied to real data."}, "http://arxiv.org/abs/2310.03164": {"title": "A Hierarchical Random Effects State-space Model for Modeling Brain Activities from Electroencephalogram Data", "link": "http://arxiv.org/abs/2310.03164", "description": "Mental disorders present challenges in diagnosis and treatment due to their\ncomplex and heterogeneous nature. Electroencephalogram (EEG) has shown promise\nas a potential biomarker for these disorders. However, existing methods for\nanalyzing EEG signals have limitations in addressing heterogeneity and\ncapturing complex brain activity patterns between regions. This paper proposes\na novel random effects state-space model (RESSM) for analyzing large-scale\nmulti-channel resting-state EEG signals, accounting for the heterogeneity of\nbrain connectivities between groups and individual subjects. We incorporate\nmulti-level random effects for temporal dynamical and spatial mapping matrices\nand address nonstationarity so that the brain connectivity patterns can vary\nover time. The model is fitted under a Bayesian hierarchical model framework\ncoupled with a Gibbs sampler. Compared to previous mixed-effects state-space\nmodels, we directly model high-dimensional random effects matrices without\nstructural constraints and tackle the challenge of identifiability. Through\nextensive simulation studies, we demonstrate that our approach yields valid\nestimation and inference. We apply RESSM to a multi-site clinical trial of\nMajor Depressive Disorder (MDD). Our analysis uncovers significant differences\nin resting-state brain temporal dynamics among MDD patients compared to healthy\nindividuals. In addition, we show the subject-level EEG features derived from\nRESSM exhibit a superior predictive value for the heterogeneous treatment\neffect compared to the EEG frequency band power, suggesting the potential of\nEEG as a valuable biomarker for MDD."}, "http://arxiv.org/abs/2310.03258": {"title": "Detecting Electricity Service Equity Issues with Transfer Counterfactual Learning on Large-Scale Outage Datasets", "link": "http://arxiv.org/abs/2310.03258", "description": "Energy justice is a growing area of interest in interdisciplinary energy\nresearch. However, identifying systematic biases in the energy sector remains\nchallenging due to confounding variables, intricate heterogeneity in treatment\neffects, and limited data availability. To address these challenges, we\nintroduce a novel approach for counterfactual causal analysis centered on\nenergy justice. We use subgroup analysis to manage diverse factors and leverage\nthe idea of transfer learning to mitigate data scarcity in each subgroup. In\nour numerical analysis, we apply our method to a large-scale customer-level\npower outage data set and investigate the counterfactual effect of demographic\nfactors, such as income and age of the population, on power outage durations.\nOur results indicate that low-income and elderly-populated areas consistently\nexperience longer power outages, regardless of weather conditions. This points\nto existing biases in the power system and highlights the need for focused\nimprovements in areas with economic challenges."}, "http://arxiv.org/abs/2310.03351": {"title": "Efficiently analyzing large patient registries with Bayesian joint models for longitudinal and time-to-event data", "link": "http://arxiv.org/abs/2310.03351", "description": "The joint modeling of longitudinal and time-to-event outcomes has become a\npopular tool in follow-up studies. However, fitting Bayesian joint models to\nlarge datasets, such as patient registries, can require extended computing\ntimes. To speed up sampling, we divided a patient registry dataset into\nsubsamples, analyzed them in parallel, and combined the resulting Markov chain\nMonte Carlo draws into a consensus distribution. We used a simulation study to\ninvestigate how different consensus strategies perform with joint models. In\nparticular, we compared grouping all draws together with using equal- and\nprecision-weighted averages. We considered scenarios reflecting different\nsample sizes, numbers of data splits, and processor characteristics.\nParallelization of the sampling process substantially decreased the time\nrequired to run the model. We found that the weighted-average consensus\ndistributions for large sample sizes were nearly identical to the target\nposterior distribution. The proposed algorithm has been made available in an R\npackage for joint models, JMbayes2. This work was motivated by the clinical\ninterest in investigating the association between ppFEV1, a commonly measured\nmarker of lung function, and the risk of lung transplant or death, using data\nfrom the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals\nwith 372,366 years of cumulative follow-up). Splitting the registry into five\nsubsamples resulted in an 85\\% decrease in computing time, from 9.22 to 1.39\nhours. Splitting the data and finding a consensus distribution by\nprecision-weighted averaging proved to be a computationally efficient and\nrobust approach to handling large datasets under the joint modeling framework."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2310.03630": {"title": "Model-based Clustering for Network Data via a Latent Shrinkage Position Cluster Model", "link": "http://arxiv.org/abs/2310.03630", "description": "Low-dimensional representation and clustering of network data are tasks of\ngreat interest across various fields. Latent position models are routinely used\nfor this purpose by assuming that each node has a location in a low-dimensional\nlatent space, and enabling node clustering. However, these models fall short in\nsimultaneously determining the optimal latent space dimension and the number of\nclusters. Here we introduce the latent shrinkage position cluster model\n(LSPCM), which addresses this limitation. The LSPCM posits a Bayesian\nnonparametric shrinkage prior on the latent positions' variance parameters\nresulting in higher dimensions having increasingly smaller variances, aiding in\nthe identification of dimensions with non-negligible variance. Further, the\nLSPCM assumes the latent positions follow a sparse finite Gaussian mixture\nmodel, allowing for automatic inference on the number of clusters related to\nnon-empty mixture components. As a result, the LSPCM simultaneously infers the\nlatent space dimensionality and the number of clusters, eliminating the need to\nfit and compare multiple models. The performance of the LSPCM is assessed via\nsimulation studies and demonstrated through application to two real Twitter\nnetwork datasets from sporting and political contexts. Open source software is\navailable to promote widespread use of the LSPCM."}, "http://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "http://arxiv.org/abs/2310.03722", "description": "In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$\nof a Gaussian distribution with unknown variance $\\sigma$. Curiously, he\nemployed both an improper (right Haar) mixture over $\\sigma$ and an improper\n(flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his\nconstruction, which use generalized nonintegrable martingales and an extended\nVille's inequality. While this does yield a sequential t-test, it does not\nyield an ``e-process'' (due to the nonintegrability of his martingale). In this\npaper, we develop two new e-processes and confidence sequences for the same\nsetting: one is a test martingale in a reduced filtration, while the other is\nan e-process in the canonical data filtration. These are respectively obtained\nby swapping Lai's flat mixture for a Gaussian mixture, and swapping the right\nHaar mixture over $\\sigma$ with the maximum likelihood estimate under the null,\nas done in universal inference. We also analyze the width of resulting\nconfidence sequences, which have a curious dependence on the error probability\n$\\alpha$. Numerical experiments are provided along the way to compare and\ncontrast the various approaches."}, "http://arxiv.org/abs/2103.10875": {"title": "Scalable Bayesian computation for crossed and nested hierarchical models", "link": "http://arxiv.org/abs/2103.10875", "description": "We develop sampling algorithms to fit Bayesian hierarchical models, the\ncomputational complexity of which scales linearly with the number of\nobservations and the number of parameters in the model. We focus on crossed\nrandom effect and nested multilevel models, which are used ubiquitously in\napplied sciences. The posterior dependence in both classes is sparse: in\ncrossed random effects models it resembles a random graph, whereas in nested\nmultilevel models it is tree-structured. For each class we identify a framework\nfor scalable computation, building on previous work. Methods for crossed models\nare based on extensions of appropriately designed collapsed Gibbs samplers,\nwhere we introduce the idea of local centering; while methods for nested models\nare based on sparse linear algebra and data augmentation. We provide a\ntheoretical analysis of the proposed algorithms in some simplified settings,\nincluding a comparison with previously proposed methodologies and an\naverage-case analysis based on random graph theory. Numerical experiments,\nincluding two challenging real data analyses on predicting electoral results\nand real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo,\ndisplaying drastic improvement in performance."}, "http://arxiv.org/abs/2106.04106": {"title": "A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance", "link": "http://arxiv.org/abs/2106.04106", "description": "Genome-wide association studies (GWAS) have identified thousands of genetic\nvariants associated with complex traits, and some variants are shown to be\nassociated with multiple complex traits. Genetic covariance between two traits\nis defined as the underlying covariance of genetic effects and can be used to\nmeasure the shared genetic architecture. The data used to estimate such a\ngenetic covariance can be from the same group or different groups of\nindividuals, and the traits can be of different types or collected based on\ndifferent study designs. This paper proposes a unified regression-based\napproach to robust estimation and inference for genetic covariance of general\ntraits that may be associated with genetic variants nonlinearly. The asymptotic\nproperties of the proposed estimator are provided and are shown to be robust\nunder certain model mis-specification. Our method under linear working models\nprovides a robust inference for the narrow-sense genetic covariance, even when\nboth linear models are mis-specified. Numerical experiments are performed to\nsupport the theoretical results. Our method is applied to an outbred mice GWAS\ndata set to study the overlapping genetic effects between the behavioral and\nphysiological phenotypes. The real data results reveal interesting genetic\ncovariance among different mice developmental traits."}, "http://arxiv.org/abs/2112.08417": {"title": "Characterization of causal ancestral graphs for time series with latent confounders", "link": "http://arxiv.org/abs/2112.08417", "description": "In this paper, we introduce a novel class of graphical models for\nrepresenting time lag specific causal relationships and independencies of\nmultivariate time series with unobserved confounders. We completely\ncharacterize these graphs and show that they constitute proper subsets of the\ncurrently employed model classes. As we show, from the novel graphs one can\nthus draw stronger causal inferences -- without additional assumptions. We\nfurther introduce a graphical representation of Markov equivalence classes of\nthe novel graphs. This graphical representation contains more causal knowledge\nthan what current state-of-the-art causal discovery algorithms learn."}, "http://arxiv.org/abs/2112.09313": {"title": "Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects", "link": "http://arxiv.org/abs/2112.09313", "description": "Federated learning of causal estimands may greatly improve estimation\nefficiency by leveraging data from multiple study sites, but robustness to\nheterogeneity and model misspecifications is vital for ensuring validity. We\ndevelop a Federated Adaptive Causal Estimation (FACE) framework to incorporate\nheterogeneous data from multiple sites to provide treatment effect estimation\nand inference for a flexibly specified target population of interest. FACE\naccounts for site-level heterogeneity in the distribution of covariates through\ndensity ratio weighting. To safely incorporate source sites and avoid negative\ntransfer, we introduce an adaptive weighting procedure via a penalized\nregression, which achieves both consistency and optimal efficiency. Our\nstrategy is communication-efficient and privacy-preserving, allowing\nparticipating sites to share summary statistics only once with other sites. We\nconduct both theoretical and numerical evaluations of FACE and apply it to\nconduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273\n(Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic\nhealth records from five VA regional sites. We show that compared to\ntraditional methods, FACE meaningfully increases the precision of treatment\neffect estimates, with reductions in standard errors ranging from $26\\%$ to\n$67\\%$."}, "http://arxiv.org/abs/2208.03246": {"title": "Non-Asymptotic Analysis of Ensemble Kalman Updates: Effective Dimension and Localization", "link": "http://arxiv.org/abs/2208.03246", "description": "Many modern algorithms for inverse problems and data assimilation rely on\nensemble Kalman updates to blend prior predictions with observed data. Ensemble\nKalman methods often perform well with a small ensemble size, which is\nessential in applications where generating each particle is costly. This paper\ndevelops a non-asymptotic analysis of ensemble Kalman updates that rigorously\nexplains why a small ensemble size suffices if the prior covariance has\nmoderate effective dimension due to fast spectrum decay or approximate\nsparsity. We present our theory in a unified framework, comparing several\nimplementations of ensemble Kalman updates that use perturbed observations,\nsquare root filtering, and localization. As part of our analysis, we develop\nnew dimension-free covariance estimation bounds for approximately sparse\nmatrices that may be of independent interest."}, "http://arxiv.org/abs/2307.10972": {"title": "Adaptively Weighted Audits of Instant-Runoff Voting Elections: AWAIRE", "link": "http://arxiv.org/abs/2307.10972", "description": "An election audit is risk-limiting if the audit limits (to a pre-specified\nthreshold) the chance that an erroneous electoral outcome will be certified.\nExtant methods for auditing instant-runoff voting (IRV) elections are either\nnot risk-limiting or require cast vote records (CVRs), the voting system's\nelectronic record of the votes on each ballot. CVRs are not always available,\nfor instance, in jurisdictions that tabulate IRV contests manually.\n\nWe develop an RLA method (AWAIRE) that uses adaptively weighted averages of\ntest supermartingales to efficiently audit IRV elections when CVRs are not\navailable. The adaptive weighting 'learns' an efficient set of hypotheses to\ntest to confirm the election outcome. When accurate CVRs are available, AWAIRE\ncan use them to increase the efficiency to match the performance of existing\nmethods that require CVRs.\n\nWe provide an open-source prototype implementation that can handle elections\nwith up to six candidates. Simulations using data from real elections show that\nAWAIRE is likely to be efficient in practice. We discuss how to extend the\ncomputational approach to handle elections with more candidates.\n\nAdaptively weighted averages of test supermartingales are a general tool,\nuseful beyond election audits to test collections of hypotheses sequentially\nwhile rigorously controlling the familywise error rate."}, "http://arxiv.org/abs/2309.10514": {"title": "Partially Specified Causal Simulations", "link": "http://arxiv.org/abs/2309.10514", "description": "Simulation studies play a key role in the validation of causal inference\nmethods. The simulation results are reliable only if the study is designed\naccording to the promised operational conditions of the method-in-test. Still,\nmany causal inference literature tend to design over-restricted or misspecified\nstudies. In this paper, we elaborate on the problem of improper simulation\ndesign for causal methods and compile a list of desiderata for an effective\nsimulation framework. We then introduce partially randomized causal simulation\n(PARCS), a simulation framework that meets those desiderata. PARCS synthesizes\ndata based on graphical causal models and a wide range of adjustable\nparameters. There is a legible mapping from usual causal assumptions to the\nparameters, thus, users can identify and specify the subset of related\nparameters and randomize the remaining ones to generate a range of complying\ndata-generating processes for their causal method. The result is a more\ncomprehensive and inclusive empirical investigation for causal claims. Using\nPARCS, we reproduce and extend the simulation studies of two well-known causal\ndiscovery and missing data analysis papers to emphasize the necessity of a\nproper simulation design. Our results show that those papers would have\nimproved and extended the findings, had they used PARCS for simulation. The\nframework is implemented as a Python package, too. By discussing the\ncomprehensiveness and transparency of PARCS, we encourage causal inference\nresearchers to utilize it as a standard tool for future works."}, "http://arxiv.org/abs/2310.03776": {"title": "Significance of the negative binomial distribution in multiplicity phenomena", "link": "http://arxiv.org/abs/2310.03776", "description": "The negative binomial distribution (NBD) has been theorized to express a\nscale-invariant property of many-body systems and has been consistently shown\nto outperform other statistical models in both describing the multiplicity of\nquantum-scale events in particle collision experiments and predicting the\nprevalence of cosmological observables, such as the number of galaxies in a\nregion of space. Despite its widespread applicability and empirical success in\nthese contexts, a theoretical justification for the NBD from first principles\nhas remained elusive for fifty years. The accuracy of the NBD in modeling\nhadronic, leptonic, and semileptonic processes is suggestive of a highly\ngeneral principle, which is yet to be understood. This study demonstrates that\na statistical event of the NBD can in fact be derived in a general context via\nthe dynamical equations of a canonical ensemble of particles in Minkowski\nspace. These results describe a fundamental feature of many-body systems that\nis consistent with data from the ALICE and ATLAS experiments and provides an\nexplanation for the emergence of the NBD in these multiplicity observations.\nTwo methods are used to derive this correspondence: the Feynman path integral\nand a hypersurface parametrization of a propagating ensemble."}, "http://arxiv.org/abs/2310.04030": {"title": "Robust inference with GhostKnockoffs in genome-wide association studies", "link": "http://arxiv.org/abs/2310.04030", "description": "Genome-wide association studies (GWASs) have been extensively adopted to\ndepict the underlying genetic architecture of complex diseases. Motivated by\nGWASs' limitations in identifying small effect loci to understand complex\ntraits' polygenicity and fine-mapping putative causal variants from proxy ones,\nwe propose a knockoff-based method which only requires summary statistics from\nGWASs and demonstrate its validity in the presence of relatedness. We show that\nGhostKnockoffs inference is robust to its input Z-scores as long as they are\nfrom valid marginal association tests and their correlations are consistent\nwith the correlations among the corresponding genetic variants. The property\ngeneralizes GhostKnockoffs to other GWASs settings, such as the meta-analysis\nof multiple overlapping studies and studies based on association test\nstatistics deviated from score tests. We demonstrate GhostKnockoffs'\nperformance using empirical simulation and a meta-analysis of nine European\nancestral genome-wide association studies and whole exome/genome sequencing\nstudies. Both results demonstrate that GhostKnockoffs identify more putative\ncausal variants with weak genotype-phenotype associations that are missed by\nconventional GWASs."}, "http://arxiv.org/abs/2310.04082": {"title": "An energy-based model approach to rare event probability estimation", "link": "http://arxiv.org/abs/2310.04082", "description": "The estimation of rare event probabilities plays a pivotal role in diverse\nfields. Our aim is to determine the probability of a hazard or system failure\noccurring when a quantity of interest exceeds a critical value. In our\napproach, the distribution of the quantity of interest is represented by an\nenergy density, characterized by a free energy function. To efficiently\nestimate the free energy, a bias potential is introduced. Using concepts from\nenergy-based models (EBM), this bias potential is optimized such that the\ncorresponding probability density function approximates a pre-defined\ndistribution targeting the failure region of interest. Given the optimal bias\npotential, the free energy function and the rare event probability of interest\ncan be determined. The approach is applicable not just in traditional rare\nevent settings where the variable upon which the quantity of interest relies\nhas a known distribution, but also in inversion settings where the variable\nfollows a posterior distribution. By combining the EBM approach with a Stein\ndiscrepancy-based stopping criterion, we aim for a balanced accuracy-efficiency\ntrade-off. Furthermore, we explore both parametric and non-parametric\napproaches for the bias potential, with the latter eliminating the need for\nchoosing a particular parameterization, but depending strongly on the accuracy\nof the kernel density estimate used in the optimization process. Through three\nillustrative test cases encompassing both traditional and inversion settings,\nwe show that the proposed EBM approach, when properly configured, (i) allows\nstable and efficient estimation of rare event probabilities and (ii) compares\nfavorably against subset sampling approaches."}, "http://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "http://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set\nof likelihood components. This approach provides a flexible framework for\ndrawing inference when the likelihood function of a statistical model is\ncomputationally intractable. While composite likelihood has computational\nadvantages, it can still be demanding when dealing with numerous likelihood\ncomponents and a large sample size. This paper tackles this challenge by\nemploying an approximation of the conventional composite likelihood estimator,\nwhich is derived from an optimization procedure relying on stochastic\ngradients. This novel estimator is shown to be asymptotically normally\ndistributed around the true parameter. In particular, based on the relative\ndivergent rate of the sample size and the number of iterations of the\noptimization, the variance of the limiting distribution is shown to compound\nfor two sources of uncertainty: the sampling variability of the data and the\noptimization noise, with the latter depending on the sampling distribution used\nto construct the stochastic gradients. The advantages of the proposed framework\nare illustrated through simulation studies on two working examples: an Ising\nmodel for binary data and a gamma frailty model for count data. Finally, a\nreal-data application is presented, showing its effectiveness in a large-scale\nmental health survey."}, "http://arxiv.org/abs/1904.06340": {"title": "A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Processes", "link": "http://arxiv.org/abs/1904.06340", "description": "This paper develops a unified and computationally efficient method for\nchange-point estimation along the time dimension in a non-stationary\nspatio-temporal process. By modeling a non-stationary spatio-temporal process\nas a piecewise stationary spatio-temporal process, we consider simultaneous\nestimation of the number and locations of change-points, and model parameters\nin each segment. A composite likelihood-based criterion is developed for\nchange-point and parameters estimation. Under the framework of increasing\ndomain asymptotics, theoretical results including consistency and distribution\nof the estimators are derived under mild conditions. In contrast to classical\nresults in fixed dimensional time series that the localization error of\nchange-point estimator is $O_{p}(1)$, exact recovery of true change-points can\nbe achieved in the spatio-temporal setting. More surprisingly, the consistency\nof change-point estimation can be achieved without any penalty term in the\ncriterion function. In addition, we further establish consistency of the number\nand locations of the change-point estimator under the infill asymptotics\nframework where the time domain is increasing while the spatial sampling domain\nis fixed. A computationally efficient pruned dynamic programming algorithm is\ndeveloped for the challenging criterion optimization problem. Extensive\nsimulation studies and an application to U.S. precipitation data are provided\nto demonstrate the effectiveness and practicality of the proposed method."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2208.00137": {"title": "Efficient estimation and inference for the signed $\\beta$-model in directed signed networks", "link": "http://arxiv.org/abs/2208.00137", "description": "This paper proposes a novel signed $\\beta$-model for directed signed network,\nwhich is frequently encountered in application domains but largely neglected in\nliterature. The proposed signed $\\beta$-model decomposes a directed signed\nnetwork as the difference of two unsigned networks and embeds each node with\ntwo latent factors for in-status and out-status. The presence of negative edges\nleads to a non-concave log-likelihood, and a one-step estimation algorithm is\ndeveloped to facilitate parameter estimation, which is efficient both\ntheoretically and computationally. We also develop an inferential procedure for\npairwise and multiple node comparisons under the signed $\\beta$-model, which\nfills the void of lacking uncertainty quantification for node ranking.\nTheoretical results are established for the coverage probability of confidence\ninterval, as well as the false discovery rate (FDR) control for multiple node\ncomparison. The finite sample performance of the signed $\\beta$-model is also\nexamined through extensive numerical experiments on both synthetic and\nreal-life networks."}, "http://arxiv.org/abs/2208.08401": {"title": "Conformal Inference for Online Prediction with Arbitrary Distribution Shifts", "link": "http://arxiv.org/abs/2208.08401", "description": "We consider the problem of forming prediction sets in an online setting where\nthe distribution generating the data is allowed to vary over time. Previous\napproaches to this problem suffer from over-weighting historical data and thus\nmay fail to quickly react to the underlying dynamics. Here we correct this\nissue and develop a novel procedure with provably small regret over all local\ntime intervals of a given width. We achieve this by modifying the adaptive\nconformal inference (ACI) algorithm of Gibbs and Cand\\`{e}s (2021) to contain\nan additional step in which the step-size parameter of ACI's gradient descent\nupdate is tuned over time. Crucially, this means that unlike ACI, which\nrequires knowledge of the rate of change of the data-generating mechanism, our\nnew procedure is adaptive to both the size and type of the distribution shift.\nOur methods are highly flexible and can be used in combination with any\nbaseline predictive algorithm that produces point estimates or estimated\nquantiles of the target without the need for distributional assumptions. We\ntest our techniques on two real-world datasets aimed at predicting stock market\nvolatility and COVID-19 case counts and find that they are robust and adaptive\nto real-world distribution shifts."}, "http://arxiv.org/abs/2303.01031": {"title": "Identifiability and Consistent Estimation of the Gaussian Chain Graph Model", "link": "http://arxiv.org/abs/2303.01031", "description": "The chain graph model admits both undirected and directed edges in one graph,\nwhere symmetric conditional dependencies are encoded via undirected edges and\nasymmetric causal relations are encoded via directed edges. Though frequently\nencountered in practice, the chain graph model has been largely under\ninvestigated in literature, possibly due to the lack of identifiability\nconditions between undirected and directed edges. In this paper, we first\nestablish a set of novel identifiability conditions for the Gaussian chain\ngraph model, exploiting a low rank plus sparse decomposition of the precision\nmatrix. Further, an efficient learning algorithm is built upon the\nidentifiability conditions to fully recover the chain graph structure.\nTheoretical analysis on the proposed method is conducted, assuring its\nasymptotic consistency in recovering the exact chain graph structure. The\nadvantage of the proposed method is also supported by numerical experiments on\nboth simulated examples and a real application on the Standard &amp; Poor 500 index\ndata."}, "http://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "http://arxiv.org/abs/2305.10817", "description": "We introduce an approach which allows detecting causal relationships between\nvariables for which the time evolution is available. Causality is assessed by a\nvariational scheme based on the Information Imbalance of distance ranks, a\nstatistical test capable of inferring the relative information content of\ndifferent distance measures. We test whether the predictability of a putative\ndriven system Y can be improved by incorporating information from a potential\ndriver system X, without making assumptions on the underlying dynamics and\nwithout the need to compute probability densities of the dynamic variables.\nThis framework makes causality detection possible even for high-dimensional\nsystems where only few of the variables are known or measured. Benchmark tests\non coupled chaotic dynamical systems demonstrate that our approach outperforms\nother model-free causality detection methods, successfully handling both\nunidirectional and bidirectional couplings. We also show that the method can be\nused to robustly detect causality in human electroencephalography data."}, "http://arxiv.org/abs/2309.06264": {"title": "Spectral clustering algorithm for the allometric extension model", "link": "http://arxiv.org/abs/2309.06264", "description": "The spectral clustering algorithm is often used as a binary clustering method\nfor unclassified data by applying the principal component analysis. To study\ntheoretical properties of the algorithm, the assumption of conditional\nhomoscedasticity is often supposed in existing studies. However, this\nassumption is restrictive and often unrealistic in practice. Therefore, in this\npaper, we consider the allometric extension model, that is, the directions of\nthe first eigenvectors of two covariance matrices and the direction of the\ndifference of two mean vectors coincide, and we provide a non-asymptotic bound\nof the error probability of the spectral clustering algorithm for the\nallometric extension model. As a byproduct of the result, we obtain the\nconsistency of the clustering method in high-dimensional settings."}, "http://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "http://arxiv.org/abs/2309.12833", "description": "Discovering causal relationships from observational data is a fundamental yet\nchallenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a\nmethod for causal feature selection which requires data from heterogeneous\nsettings and exploits that causal models are invariant. ICP has been extended\nto general additive noise models and to nonparametric settings using\nconditional independence tests. However, the latter often suffer from low power\n(or poor type I error control) and additive noise models are not suitable for\napplications in which the response is not measured on a continuous scale, but\nreflects categories or counts. Here, we develop transformation-model (TRAM)\nbased ICP, allowing for continuous, categorical, count-type, and\nuninformatively censored responses (these model classes, generally, do not\nallow for identifiability when there is no exogenous heterogeneity). As an\ninvariance test, we propose TRAM-GCM based on the expected conditional\ncovariance between environments and score residuals with uniform asymptotic\nlevel guarantees. For the special case of linear shift TRAMs, we also consider\nTRAM-Wald, which tests invariance based on the Wald statistic. We provide an\nopen-source R package 'tramicp' and evaluate our approach on simulated data and\nin a case study investigating causal features of survival in critically ill\npatients."}}
+{"http://arxiv.org/abs/2310.03114": {"title": "Bayesian Parameter Inference for Partially Observed Stochastic Volterra Equations", "link": "http://arxiv.org/abs/2310.03114", "description": "In this article we consider Bayesian parameter inference for a type of\npartially observed stochastic Volterra equation (SVE). SVEs are found in many\nareas such as physics and mathematical finance. In the latter field they can be\nused to represent long memory in unobserved volatility processes. In many cases\nof practical interest, SVEs must be time-discretized and then parameter\ninference is based upon the posterior associated to this time-discretized\nprocess. Based upon recent studies on time-discretization of SVEs (e.g. Richard\net al. 2021), we use Euler-Maruyama methods for the afore-mentioned\ndiscretization. We then show how multilevel Markov chain Monte Carlo (MCMC)\nmethods (Jasra et al. 2018) can be applied in this context. In the examples we\nstudy, we give a proof that shows that the cost to achieve a mean square error\n(MSE) of $\\mathcal{O}(\\epsilon^2)$, $\\epsilon&gt;0$, is\n$\\mathcal{O}(\\epsilon^{-20/9})$. If one uses a single level MCMC method then\nthe cost is $\\mathcal{O}(\\epsilon^{-38/9})$ to achieve the same MSE. We\nillustrate these results in the context of state-space and stochastic\nvolatility models, with the latter applied to real data."}, "http://arxiv.org/abs/2310.03164": {"title": "A Hierarchical Random Effects State-space Model for Modeling Brain Activities from Electroencephalogram Data", "link": "http://arxiv.org/abs/2310.03164", "description": "Mental disorders present challenges in diagnosis and treatment due to their\ncomplex and heterogeneous nature. Electroencephalogram (EEG) has shown promise\nas a potential biomarker for these disorders. However, existing methods for\nanalyzing EEG signals have limitations in addressing heterogeneity and\ncapturing complex brain activity patterns between regions. This paper proposes\na novel random effects state-space model (RESSM) for analyzing large-scale\nmulti-channel resting-state EEG signals, accounting for the heterogeneity of\nbrain connectivities between groups and individual subjects. We incorporate\nmulti-level random effects for temporal dynamical and spatial mapping matrices\nand address nonstationarity so that the brain connectivity patterns can vary\nover time. The model is fitted under a Bayesian hierarchical model framework\ncoupled with a Gibbs sampler. Compared to previous mixed-effects state-space\nmodels, we directly model high-dimensional random effects matrices without\nstructural constraints and tackle the challenge of identifiability. Through\nextensive simulation studies, we demonstrate that our approach yields valid\nestimation and inference. We apply RESSM to a multi-site clinical trial of\nMajor Depressive Disorder (MDD). Our analysis uncovers significant differences\nin resting-state brain temporal dynamics among MDD patients compared to healthy\nindividuals. In addition, we show the subject-level EEG features derived from\nRESSM exhibit a superior predictive value for the heterogeneous treatment\neffect compared to the EEG frequency band power, suggesting the potential of\nEEG as a valuable biomarker for MDD."}, "http://arxiv.org/abs/2310.03258": {"title": "Detecting Electricity Service Equity Issues with Transfer Counterfactual Learning on Large-Scale Outage Datasets", "link": "http://arxiv.org/abs/2310.03258", "description": "Energy justice is a growing area of interest in interdisciplinary energy\nresearch. However, identifying systematic biases in the energy sector remains\nchallenging due to confounding variables, intricate heterogeneity in treatment\neffects, and limited data availability. To address these challenges, we\nintroduce a novel approach for counterfactual causal analysis centered on\nenergy justice. We use subgroup analysis to manage diverse factors and leverage\nthe idea of transfer learning to mitigate data scarcity in each subgroup. In\nour numerical analysis, we apply our method to a large-scale customer-level\npower outage data set and investigate the counterfactual effect of demographic\nfactors, such as income and age of the population, on power outage durations.\nOur results indicate that low-income and elderly-populated areas consistently\nexperience longer power outages, regardless of weather conditions. This points\nto existing biases in the power system and highlights the need for focused\nimprovements in areas with economic challenges."}, "http://arxiv.org/abs/2310.03351": {"title": "Efficiently analyzing large patient registries with Bayesian joint models for longitudinal and time-to-event data", "link": "http://arxiv.org/abs/2310.03351", "description": "The joint modeling of longitudinal and time-to-event outcomes has become a\npopular tool in follow-up studies. However, fitting Bayesian joint models to\nlarge datasets, such as patient registries, can require extended computing\ntimes. To speed up sampling, we divided a patient registry dataset into\nsubsamples, analyzed them in parallel, and combined the resulting Markov chain\nMonte Carlo draws into a consensus distribution. We used a simulation study to\ninvestigate how different consensus strategies perform with joint models. In\nparticular, we compared grouping all draws together with using equal- and\nprecision-weighted averages. We considered scenarios reflecting different\nsample sizes, numbers of data splits, and processor characteristics.\nParallelization of the sampling process substantially decreased the time\nrequired to run the model. We found that the weighted-average consensus\ndistributions for large sample sizes were nearly identical to the target\nposterior distribution. The proposed algorithm has been made available in an R\npackage for joint models, JMbayes2. This work was motivated by the clinical\ninterest in investigating the association between ppFEV1, a commonly measured\nmarker of lung function, and the risk of lung transplant or death, using data\nfrom the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals\nwith 372,366 years of cumulative follow-up). Splitting the registry into five\nsubsamples resulted in an 85\\% decrease in computing time, from 9.22 to 1.39\nhours. Splitting the data and finding a consensus distribution by\nprecision-weighted averaging proved to be a computationally efficient and\nrobust approach to handling large datasets under the joint modeling framework."}, "http://arxiv.org/abs/2310.03521": {"title": "Cutting Feedback in Misspecified Copula Models", "link": "http://arxiv.org/abs/2310.03521", "description": "In copula models the marginal distributions and copula function are specified\nseparately. We treat these as two modules in a modular Bayesian inference\nframework, and propose conducting modified Bayesian inference by ``cutting\nfeedback''. Cutting feedback limits the influence of potentially misspecified\nmodules in posterior inference. We consider two types of cuts. The first limits\nthe influence of a misspecified copula on inference for the marginals, which is\na Bayesian analogue of the popular Inference for Margins (IFM) estimator. The\nsecond limits the influence of misspecified marginals on inference for the\ncopula parameters by using a rank likelihood to define the cut model. We\nestablish that if only one of the modules is misspecified, then the appropriate\ncut posterior gives accurate uncertainty quantification asymptotically for the\nparameters in the other module. Computation of the cut posteriors is difficult,\nand new variational inference methods to do so are proposed. The efficacy of\nthe new methodology is demonstrated using both simulated data and a substantive\nmultivariate time series application from macroeconomic forecasting. In the\nlatter, cutting feedback from misspecified marginals to a 1096 dimension copula\nimproves posterior inference and predictive accuracy greatly, compared to\nconventional Bayesian inference."}, "http://arxiv.org/abs/2310.03630": {"title": "Model-based Clustering for Network Data via a Latent Shrinkage Position Cluster Model", "link": "http://arxiv.org/abs/2310.03630", "description": "Low-dimensional representation and clustering of network data are tasks of\ngreat interest across various fields. Latent position models are routinely used\nfor this purpose by assuming that each node has a location in a low-dimensional\nlatent space, and enabling node clustering. However, these models fall short in\nsimultaneously determining the optimal latent space dimension and the number of\nclusters. Here we introduce the latent shrinkage position cluster model\n(LSPCM), which addresses this limitation. The LSPCM posits a Bayesian\nnonparametric shrinkage prior on the latent positions' variance parameters\nresulting in higher dimensions having increasingly smaller variances, aiding in\nthe identification of dimensions with non-negligible variance. Further, the\nLSPCM assumes the latent positions follow a sparse finite Gaussian mixture\nmodel, allowing for automatic inference on the number of clusters related to\nnon-empty mixture components. As a result, the LSPCM simultaneously infers the\nlatent space dimensionality and the number of clusters, eliminating the need to\nfit and compare multiple models. The performance of the LSPCM is assessed via\nsimulation studies and demonstrated through application to two real Twitter\nnetwork datasets from sporting and political contexts. Open source software is\navailable to promote widespread use of the LSPCM."}, "http://arxiv.org/abs/2310.03722": {"title": "Anytime-valid t-tests and confidence sequences for Gaussian means with unknown variance", "link": "http://arxiv.org/abs/2310.03722", "description": "In 1976, Lai constructed a nontrivial confidence sequence for the mean $\\mu$\nof a Gaussian distribution with unknown variance $\\sigma$. Curiously, he\nemployed both an improper (right Haar) mixture over $\\sigma$ and an improper\n(flat) mixture over $\\mu$. Here, we elaborate carefully on the details of his\nconstruction, which use generalized nonintegrable martingales and an extended\nVille's inequality. While this does yield a sequential t-test, it does not\nyield an ``e-process'' (due to the nonintegrability of his martingale). In this\npaper, we develop two new e-processes and confidence sequences for the same\nsetting: one is a test martingale in a reduced filtration, while the other is\nan e-process in the canonical data filtration. These are respectively obtained\nby swapping Lai's flat mixture for a Gaussian mixture, and swapping the right\nHaar mixture over $\\sigma$ with the maximum likelihood estimate under the null,\nas done in universal inference. We also analyze the width of resulting\nconfidence sequences, which have a curious dependence on the error probability\n$\\alpha$. Numerical experiments are provided along the way to compare and\ncontrast the various approaches."}, "http://arxiv.org/abs/2103.10875": {"title": "Scalable Bayesian computation for crossed and nested hierarchical models", "link": "http://arxiv.org/abs/2103.10875", "description": "We develop sampling algorithms to fit Bayesian hierarchical models, the\ncomputational complexity of which scales linearly with the number of\nobservations and the number of parameters in the model. We focus on crossed\nrandom effect and nested multilevel models, which are used ubiquitously in\napplied sciences. The posterior dependence in both classes is sparse: in\ncrossed random effects models it resembles a random graph, whereas in nested\nmultilevel models it is tree-structured. For each class we identify a framework\nfor scalable computation, building on previous work. Methods for crossed models\nare based on extensions of appropriately designed collapsed Gibbs samplers,\nwhere we introduce the idea of local centering; while methods for nested models\nare based on sparse linear algebra and data augmentation. We provide a\ntheoretical analysis of the proposed algorithms in some simplified settings,\nincluding a comparison with previously proposed methodologies and an\naverage-case analysis based on random graph theory. Numerical experiments,\nincluding two challenging real data analyses on predicting electoral results\nand real estate prices, compare with off-the-shelf Hamiltonian Monte Carlo,\ndisplaying drastic improvement in performance."}, "http://arxiv.org/abs/2106.04106": {"title": "A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance", "link": "http://arxiv.org/abs/2106.04106", "description": "Genome-wide association studies (GWAS) have identified thousands of genetic\nvariants associated with complex traits, and some variants are shown to be\nassociated with multiple complex traits. Genetic covariance between two traits\nis defined as the underlying covariance of genetic effects and can be used to\nmeasure the shared genetic architecture. The data used to estimate such a\ngenetic covariance can be from the same group or different groups of\nindividuals, and the traits can be of different types or collected based on\ndifferent study designs. This paper proposes a unified regression-based\napproach to robust estimation and inference for genetic covariance of general\ntraits that may be associated with genetic variants nonlinearly. The asymptotic\nproperties of the proposed estimator are provided and are shown to be robust\nunder certain model mis-specification. Our method under linear working models\nprovides a robust inference for the narrow-sense genetic covariance, even when\nboth linear models are mis-specified. Numerical experiments are performed to\nsupport the theoretical results. Our method is applied to an outbred mice GWAS\ndata set to study the overlapping genetic effects between the behavioral and\nphysiological phenotypes. The real data results reveal interesting genetic\ncovariance among different mice developmental traits."}, "http://arxiv.org/abs/2112.08417": {"title": "Characterization of causal ancestral graphs for time series with latent confounders", "link": "http://arxiv.org/abs/2112.08417", "description": "In this paper, we introduce a novel class of graphical models for\nrepresenting time lag specific causal relationships and independencies of\nmultivariate time series with unobserved confounders. We completely\ncharacterize these graphs and show that they constitute proper subsets of the\ncurrently employed model classes. As we show, from the novel graphs one can\nthus draw stronger causal inferences -- without additional assumptions. We\nfurther introduce a graphical representation of Markov equivalence classes of\nthe novel graphs. This graphical representation contains more causal knowledge\nthan what current state-of-the-art causal discovery algorithms learn."}, "http://arxiv.org/abs/2112.09313": {"title": "Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects", "link": "http://arxiv.org/abs/2112.09313", "description": "Federated learning of causal estimands may greatly improve estimation\nefficiency by leveraging data from multiple study sites, but robustness to\nheterogeneity and model misspecifications is vital for ensuring validity. We\ndevelop a Federated Adaptive Causal Estimation (FACE) framework to incorporate\nheterogeneous data from multiple sites to provide treatment effect estimation\nand inference for a flexibly specified target population of interest. FACE\naccounts for site-level heterogeneity in the distribution of covariates through\ndensity ratio weighting. To safely incorporate source sites and avoid negative\ntransfer, we introduce an adaptive weighting procedure via a penalized\nregression, which achieves both consistency and optimal efficiency. Our\nstrategy is communication-efficient and privacy-preserving, allowing\nparticipating sites to share summary statistics only once with other sites. We\nconduct both theoretical and numerical evaluations of FACE and apply it to\nconduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273\n(Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic\nhealth records from five VA regional sites. We show that compared to\ntraditional methods, FACE meaningfully increases the precision of treatment\neffect estimates, with reductions in standard errors ranging from $26\\%$ to\n$67\\%$."}, "http://arxiv.org/abs/2208.03246": {"title": "Non-Asymptotic Analysis of Ensemble Kalman Updates: Effective Dimension and Localization", "link": "http://arxiv.org/abs/2208.03246", "description": "Many modern algorithms for inverse problems and data assimilation rely on\nensemble Kalman updates to blend prior predictions with observed data. Ensemble\nKalman methods often perform well with a small ensemble size, which is\nessential in applications where generating each particle is costly. This paper\ndevelops a non-asymptotic analysis of ensemble Kalman updates that rigorously\nexplains why a small ensemble size suffices if the prior covariance has\nmoderate effective dimension due to fast spectrum decay or approximate\nsparsity. We present our theory in a unified framework, comparing several\nimplementations of ensemble Kalman updates that use perturbed observations,\nsquare root filtering, and localization. As part of our analysis, we develop\nnew dimension-free covariance estimation bounds for approximately sparse\nmatrices that may be of independent interest."}, "http://arxiv.org/abs/2307.10972": {"title": "Adaptively Weighted Audits of Instant-Runoff Voting Elections: AWAIRE", "link": "http://arxiv.org/abs/2307.10972", "description": "An election audit is risk-limiting if the audit limits (to a pre-specified\nthreshold) the chance that an erroneous electoral outcome will be certified.\nExtant methods for auditing instant-runoff voting (IRV) elections are either\nnot risk-limiting or require cast vote records (CVRs), the voting system's\nelectronic record of the votes on each ballot. CVRs are not always available,\nfor instance, in jurisdictions that tabulate IRV contests manually.\n\nWe develop an RLA method (AWAIRE) that uses adaptively weighted averages of\ntest supermartingales to efficiently audit IRV elections when CVRs are not\navailable. The adaptive weighting 'learns' an efficient set of hypotheses to\ntest to confirm the election outcome. When accurate CVRs are available, AWAIRE\ncan use them to increase the efficiency to match the performance of existing\nmethods that require CVRs.\n\nWe provide an open-source prototype implementation that can handle elections\nwith up to six candidates. Simulations using data from real elections show that\nAWAIRE is likely to be efficient in practice. We discuss how to extend the\ncomputational approach to handle elections with more candidates.\n\nAdaptively weighted averages of test supermartingales are a general tool,\nuseful beyond election audits to test collections of hypotheses sequentially\nwhile rigorously controlling the familywise error rate."}, "http://arxiv.org/abs/2309.10514": {"title": "Partially Specified Causal Simulations", "link": "http://arxiv.org/abs/2309.10514", "description": "Simulation studies play a key role in the validation of causal inference\nmethods. The simulation results are reliable only if the study is designed\naccording to the promised operational conditions of the method-in-test. Still,\nmany causal inference literature tend to design over-restricted or misspecified\nstudies. In this paper, we elaborate on the problem of improper simulation\ndesign for causal methods and compile a list of desiderata for an effective\nsimulation framework. We then introduce partially randomized causal simulation\n(PARCS), a simulation framework that meets those desiderata. PARCS synthesizes\ndata based on graphical causal models and a wide range of adjustable\nparameters. There is a legible mapping from usual causal assumptions to the\nparameters, thus, users can identify and specify the subset of related\nparameters and randomize the remaining ones to generate a range of complying\ndata-generating processes for their causal method. The result is a more\ncomprehensive and inclusive empirical investigation for causal claims. Using\nPARCS, we reproduce and extend the simulation studies of two well-known causal\ndiscovery and missing data analysis papers to emphasize the necessity of a\nproper simulation design. Our results show that those papers would have\nimproved and extended the findings, had they used PARCS for simulation. The\nframework is implemented as a Python package, too. By discussing the\ncomprehensiveness and transparency of PARCS, we encourage causal inference\nresearchers to utilize it as a standard tool for future works."}, "http://arxiv.org/abs/2310.03776": {"title": "Significance of the negative binomial distribution in multiplicity phenomena", "link": "http://arxiv.org/abs/2310.03776", "description": "The negative binomial distribution (NBD) has been theorized to express a\nscale-invariant property of many-body systems and has been consistently shown\nto outperform other statistical models in both describing the multiplicity of\nquantum-scale events in particle collision experiments and predicting the\nprevalence of cosmological observables, such as the number of galaxies in a\nregion of space. Despite its widespread applicability and empirical success in\nthese contexts, a theoretical justification for the NBD from first principles\nhas remained elusive for fifty years. The accuracy of the NBD in modeling\nhadronic, leptonic, and semileptonic processes is suggestive of a highly\ngeneral principle, which is yet to be understood. This study demonstrates that\na statistical event of the NBD can in fact be derived in a general context via\nthe dynamical equations of a canonical ensemble of particles in Minkowski\nspace. These results describe a fundamental feature of many-body systems that\nis consistent with data from the ALICE and ATLAS experiments and provides an\nexplanation for the emergence of the NBD in these multiplicity observations.\nTwo methods are used to derive this correspondence: the Feynman path integral\nand a hypersurface parametrization of a propagating ensemble."}, "http://arxiv.org/abs/2310.04030": {"title": "Robust inference with GhostKnockoffs in genome-wide association studies", "link": "http://arxiv.org/abs/2310.04030", "description": "Genome-wide association studies (GWASs) have been extensively adopted to\ndepict the underlying genetic architecture of complex diseases. Motivated by\nGWASs' limitations in identifying small effect loci to understand complex\ntraits' polygenicity and fine-mapping putative causal variants from proxy ones,\nwe propose a knockoff-based method which only requires summary statistics from\nGWASs and demonstrate its validity in the presence of relatedness. We show that\nGhostKnockoffs inference is robust to its input Z-scores as long as they are\nfrom valid marginal association tests and their correlations are consistent\nwith the correlations among the corresponding genetic variants. The property\ngeneralizes GhostKnockoffs to other GWASs settings, such as the meta-analysis\nof multiple overlapping studies and studies based on association test\nstatistics deviated from score tests. We demonstrate GhostKnockoffs'\nperformance using empirical simulation and a meta-analysis of nine European\nancestral genome-wide association studies and whole exome/genome sequencing\nstudies. Both results demonstrate that GhostKnockoffs identify more putative\ncausal variants with weak genotype-phenotype associations that are missed by\nconventional GWASs."}, "http://arxiv.org/abs/2310.04082": {"title": "An energy-based model approach to rare event probability estimation", "link": "http://arxiv.org/abs/2310.04082", "description": "The estimation of rare event probabilities plays a pivotal role in diverse\nfields. Our aim is to determine the probability of a hazard or system failure\noccurring when a quantity of interest exceeds a critical value. In our\napproach, the distribution of the quantity of interest is represented by an\nenergy density, characterized by a free energy function. To efficiently\nestimate the free energy, a bias potential is introduced. Using concepts from\nenergy-based models (EBM), this bias potential is optimized such that the\ncorresponding probability density function approximates a pre-defined\ndistribution targeting the failure region of interest. Given the optimal bias\npotential, the free energy function and the rare event probability of interest\ncan be determined. The approach is applicable not just in traditional rare\nevent settings where the variable upon which the quantity of interest relies\nhas a known distribution, but also in inversion settings where the variable\nfollows a posterior distribution. By combining the EBM approach with a Stein\ndiscrepancy-based stopping criterion, we aim for a balanced accuracy-efficiency\ntrade-off. Furthermore, we explore both parametric and non-parametric\napproaches for the bias potential, with the latter eliminating the need for\nchoosing a particular parameterization, but depending strongly on the accuracy\nof the kernel density estimate used in the optimization process. Through three\nillustrative test cases encompassing both traditional and inversion settings,\nwe show that the proposed EBM approach, when properly configured, (i) allows\nstable and efficient estimation of rare event probabilities and (ii) compares\nfavorably against subset sampling approaches."}, "http://arxiv.org/abs/2310.04165": {"title": "When Composite Likelihood Meets Stochastic Approximation", "link": "http://arxiv.org/abs/2310.04165", "description": "A composite likelihood is an inference function derived by multiplying a set\nof likelihood components. This approach provides a flexible framework for\ndrawing inference when the likelihood function of a statistical model is\ncomputationally intractable. While composite likelihood has computational\nadvantages, it can still be demanding when dealing with numerous likelihood\ncomponents and a large sample size. This paper tackles this challenge by\nemploying an approximation of the conventional composite likelihood estimator,\nwhich is derived from an optimization procedure relying on stochastic\ngradients. This novel estimator is shown to be asymptotically normally\ndistributed around the true parameter. In particular, based on the relative\ndivergent rate of the sample size and the number of iterations of the\noptimization, the variance of the limiting distribution is shown to compound\nfor two sources of uncertainty: the sampling variability of the data and the\noptimization noise, with the latter depending on the sampling distribution used\nto construct the stochastic gradients. The advantages of the proposed framework\nare illustrated through simulation studies on two working examples: an Ising\nmodel for binary data and a gamma frailty model for count data. Finally, a\nreal-data application is presented, showing its effectiveness in a large-scale\nmental health survey."}, "http://arxiv.org/abs/1904.06340": {"title": "A Composite Likelihood-based Approach for Change-point Detection in Spatio-temporal Processes", "link": "http://arxiv.org/abs/1904.06340", "description": "This paper develops a unified and computationally efficient method for\nchange-point estimation along the time dimension in a non-stationary\nspatio-temporal process. By modeling a non-stationary spatio-temporal process\nas a piecewise stationary spatio-temporal process, we consider simultaneous\nestimation of the number and locations of change-points, and model parameters\nin each segment. A composite likelihood-based criterion is developed for\nchange-point and parameters estimation. Under the framework of increasing\ndomain asymptotics, theoretical results including consistency and distribution\nof the estimators are derived under mild conditions. In contrast to classical\nresults in fixed dimensional time series that the localization error of\nchange-point estimator is $O_{p}(1)$, exact recovery of true change-points can\nbe achieved in the spatio-temporal setting. More surprisingly, the consistency\nof change-point estimation can be achieved without any penalty term in the\ncriterion function. In addition, we further establish consistency of the number\nand locations of the change-point estimator under the infill asymptotics\nframework where the time domain is increasing while the spatial sampling domain\nis fixed. A computationally efficient pruned dynamic programming algorithm is\ndeveloped for the challenging criterion optimization problem. Extensive\nsimulation studies and an application to U.S. precipitation data are provided\nto demonstrate the effectiveness and practicality of the proposed method."}, "http://arxiv.org/abs/2201.12936": {"title": "Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective", "link": "http://arxiv.org/abs/2201.12936", "description": "Practitioners and academics have long appreciated the benefits of covariate\nbalancing when they conduct randomized experiments. For web-facing firms\nrunning online A/B tests, however, it still remains challenging in balancing\ncovariate information when experimental subjects arrive sequentially. In this\npaper, we study an online experimental design problem, which we refer to as the\n\"Online Blocking Problem.\" In this problem, experimental subjects with\nheterogeneous covariate information arrive sequentially and must be immediately\nassigned into either the control or the treated group. The objective is to\nminimize the total discrepancy, which is defined as the minimum weight perfect\nmatching between the two groups. To solve this problem, we propose a randomized\ndesign of experiment, which we refer to as the \"Pigeonhole Design.\" The\npigeonhole design first partitions the covariate space into smaller spaces,\nwhich we refer to as pigeonholes, and then, when the experimental subjects\narrive at each pigeonhole, balances the number of control and treated subjects\nfor each pigeonhole. We analyze the theoretical performance of the pigeonhole\ndesign and show its effectiveness by comparing against two well-known benchmark\ndesigns: the match-pair design and the completely randomized design. We\nidentify scenarios when the pigeonhole design demonstrates more benefits over\nthe benchmark design. To conclude, we conduct extensive simulations using\nYahoo! data to show a 10.2% reduction in variance if we use the pigeonhole\ndesign to estimate the average treatment effect."}, "http://arxiv.org/abs/2208.00137": {"title": "Efficient estimation and inference for the signed $\\beta$-model in directed signed networks", "link": "http://arxiv.org/abs/2208.00137", "description": "This paper proposes a novel signed $\\beta$-model for directed signed network,\nwhich is frequently encountered in application domains but largely neglected in\nliterature. The proposed signed $\\beta$-model decomposes a directed signed\nnetwork as the difference of two unsigned networks and embeds each node with\ntwo latent factors for in-status and out-status. The presence of negative edges\nleads to a non-concave log-likelihood, and a one-step estimation algorithm is\ndeveloped to facilitate parameter estimation, which is efficient both\ntheoretically and computationally. We also develop an inferential procedure for\npairwise and multiple node comparisons under the signed $\\beta$-model, which\nfills the void of lacking uncertainty quantification for node ranking.\nTheoretical results are established for the coverage probability of confidence\ninterval, as well as the false discovery rate (FDR) control for multiple node\ncomparison. The finite sample performance of the signed $\\beta$-model is also\nexamined through extensive numerical experiments on both synthetic and\nreal-life networks."}, "http://arxiv.org/abs/2208.08401": {"title": "Conformal Inference for Online Prediction with Arbitrary Distribution Shifts", "link": "http://arxiv.org/abs/2208.08401", "description": "We consider the problem of forming prediction sets in an online setting where\nthe distribution generating the data is allowed to vary over time. Previous\napproaches to this problem suffer from over-weighting historical data and thus\nmay fail to quickly react to the underlying dynamics. Here we correct this\nissue and develop a novel procedure with provably small regret over all local\ntime intervals of a given width. We achieve this by modifying the adaptive\nconformal inference (ACI) algorithm of Gibbs and Cand\\`{e}s (2021) to contain\nan additional step in which the step-size parameter of ACI's gradient descent\nupdate is tuned over time. Crucially, this means that unlike ACI, which\nrequires knowledge of the rate of change of the data-generating mechanism, our\nnew procedure is adaptive to both the size and type of the distribution shift.\nOur methods are highly flexible and can be used in combination with any\nbaseline predictive algorithm that produces point estimates or estimated\nquantiles of the target without the need for distributional assumptions. We\ntest our techniques on two real-world datasets aimed at predicting stock market\nvolatility and COVID-19 case counts and find that they are robust and adaptive\nto real-world distribution shifts."}, "http://arxiv.org/abs/2303.01031": {"title": "Identifiability and Consistent Estimation of the Gaussian Chain Graph Model", "link": "http://arxiv.org/abs/2303.01031", "description": "The chain graph model admits both undirected and directed edges in one graph,\nwhere symmetric conditional dependencies are encoded via undirected edges and\nasymmetric causal relations are encoded via directed edges. Though frequently\nencountered in practice, the chain graph model has been largely under\ninvestigated in literature, possibly due to the lack of identifiability\nconditions between undirected and directed edges. In this paper, we first\nestablish a set of novel identifiability conditions for the Gaussian chain\ngraph model, exploiting a low rank plus sparse decomposition of the precision\nmatrix. Further, an efficient learning algorithm is built upon the\nidentifiability conditions to fully recover the chain graph structure.\nTheoretical analysis on the proposed method is conducted, assuring its\nasymptotic consistency in recovering the exact chain graph structure. The\nadvantage of the proposed method is also supported by numerical experiments on\nboth simulated examples and a real application on the Standard &amp; Poor 500 index\ndata."}, "http://arxiv.org/abs/2305.10817": {"title": "Robust inference of causality in high-dimensional dynamical processes from the Information Imbalance of distance ranks", "link": "http://arxiv.org/abs/2305.10817", "description": "We introduce an approach which allows detecting causal relationships between\nvariables for which the time evolution is available. Causality is assessed by a\nvariational scheme based on the Information Imbalance of distance ranks, a\nstatistical test capable of inferring the relative information content of\ndifferent distance measures. We test whether the predictability of a putative\ndriven system Y can be improved by incorporating information from a potential\ndriver system X, without making assumptions on the underlying dynamics and\nwithout the need to compute probability densities of the dynamic variables.\nThis framework makes causality detection possible even for high-dimensional\nsystems where only few of the variables are known or measured. Benchmark tests\non coupled chaotic dynamical systems demonstrate that our approach outperforms\nother model-free causality detection methods, successfully handling both\nunidirectional and bidirectional couplings. We also show that the method can be\nused to robustly detect causality in human electroencephalography data."}, "http://arxiv.org/abs/2309.06264": {"title": "Spectral clustering algorithm for the allometric extension model", "link": "http://arxiv.org/abs/2309.06264", "description": "The spectral clustering algorithm is often used as a binary clustering method\nfor unclassified data by applying the principal component analysis. To study\ntheoretical properties of the algorithm, the assumption of conditional\nhomoscedasticity is often supposed in existing studies. However, this\nassumption is restrictive and often unrealistic in practice. Therefore, in this\npaper, we consider the allometric extension model, that is, the directions of\nthe first eigenvectors of two covariance matrices and the direction of the\ndifference of two mean vectors coincide, and we provide a non-asymptotic bound\nof the error probability of the spectral clustering algorithm for the\nallometric extension model. As a byproduct of the result, we obtain the\nconsistency of the clustering method in high-dimensional settings."}, "http://arxiv.org/abs/2309.12833": {"title": "Model-based causal feature selection for general response types", "link": "http://arxiv.org/abs/2309.12833", "description": "Discovering causal relationships from observational data is a fundamental yet\nchallenging task. Invariant causal prediction (ICP, Peters et al., 2016) is a\nmethod for causal feature selection which requires data from heterogeneous\nsettings and exploits that causal models are invariant. ICP has been extended\nto general additive noise models and to nonparametric settings using\nconditional independence tests. However, the latter often suffer from low power\n(or poor type I error control) and additive noise models are not suitable for\napplications in which the response is not measured on a continuous scale, but\nreflects categories or counts. Here, we develop transformation-model (TRAM)\nbased ICP, allowing for continuous, categorical, count-type, and\nuninformatively censored responses (these model classes, generally, do not\nallow for identifiability when there is no exogenous heterogeneity). As an\ninvariance test, we propose TRAM-GCM based on the expected conditional\ncovariance between environments and score residuals with uniform asymptotic\nlevel guarantees. For the special case of linear shift TRAMs, we also consider\nTRAM-Wald, which tests invariance based on the Wald statistic. We provide an\nopen-source R package 'tramicp' and evaluate our approach on simulated data and\nin a case study investigating causal features of survival in critically ill\npatients."}, "http://arxiv.org/abs/2310.04452": {"title": "Short text classification with machine learning in the social sciences: The case of climate change on Twitter", "link": "http://arxiv.org/abs/2310.04452", "description": "To analyse large numbers of texts, social science researchers are\nincreasingly confronting the challenge of text classification. When manual\nlabeling is not possible and researchers have to find automatized ways to\nclassify texts, computer science provides a useful toolbox of machine-learning\nmethods whose performance remains understudied in the social sciences. In this\narticle, we compare the performance of the most widely used text classifiers by\napplying them to a typical research scenario in social science research: a\nrelatively small labeled dataset with infrequent occurrence of categories of\ninterest, which is a part of a large unlabeled dataset. As an example case, we\nlook at Twitter communication regarding climate change, a topic of increasing\nscholarly interest in interdisciplinary social science research. Using a novel\ndataset including 5,750 tweets from various international organizations\nregarding the highly ambiguous concept of climate change, we evaluate the\nperformance of methods in automatically classifying tweets based on whether\nthey are about climate change or not. In this context, we highlight two main\nfindings. First, supervised machine-learning methods perform better than\nstate-of-the-art lexicons, in particular as class balance increases. Second,\ntraditional machine-learning methods, such as logistic regression and random\nforest, perform similarly to sophisticated deep-learning methods, whilst\nrequiring much less training time and computational resources. The results have\nimportant implications for the analysis of short texts in social science\nresearch."}, "http://arxiv.org/abs/2310.04563": {"title": "Modeling the Risk of In-Person Instruction during the COVID-19 Pandemic", "link": "http://arxiv.org/abs/2310.04563", "description": "During the COVID-19 pandemic, implementing in-person indoor instruction in a\nsafe manner was a high priority for universities nationwide. To support this\neffort at the University, we developed a mathematical model for estimating the\nrisk of SARS-CoV-2 transmission in university classrooms. This model was used\nto design a safe classroom environment at the University during the COVID-19\npandemic that supported the higher occupancy levels needed to match\npre-pandemic numbers of in-person courses, despite a limited number of large\nclassrooms. A retrospective analysis at the end of the semester confirmed the\nmodel's assessment that the proposed classroom configuration would be safe. Our\nframework is generalizable and was also used to support reopening decisions at\nStanford University. In addition, our methods are flexible; our modeling\nframework was repurposed to plan for large university events and gatherings. We\nfound that our approach and methods work in a wide range of indoor settings and\ncould be used to support reopening planning across various industries, from\nsecondary schools to movie theaters and restaurants."}, "http://arxiv.org/abs/2310.04578": {"title": "TNDDR: Efficient and doubly robust estimation of COVID-19 vaccine effectiveness under the test-negative design", "link": "http://arxiv.org/abs/2310.04578", "description": "While the test-negative design (TND), which is routinely used for monitoring\nseasonal flu vaccine effectiveness (VE), has recently become integral to\nCOVID-19 vaccine surveillance, it is susceptible to selection bias due to\noutcome-dependent sampling. Some studies have addressed the identifiability and\nestimation of causal parameters under the TND, but efficiency bounds for\nnonparametric estimators of the target parameter under the unconfoundedness\nassumption have not yet been investigated. We propose a one-step doubly robust\nand locally efficient estimator called TNDDR (TND doubly robust), which\nutilizes sample splitting and can incorporate machine learning techniques to\nestimate the nuisance functions. We derive the efficient influence function\n(EIF) for the marginal expectation of the outcome under a vaccination\nintervention, explore the von Mises expansion, and establish the conditions for\n$\\sqrt{n}-$consistency, asymptotic normality and double robustness of TNDDR.\nThe proposed TNDDR is supported by both theoretical and empirical\njustifications, and we apply it to estimate COVID-19 VE in an administrative\ndataset of community-dwelling older people (aged $\\geq 60$y) in the province of\nQu\\'ebec, Canada."}, "http://arxiv.org/abs/2310.04660": {"title": "Balancing Weights for Causal Inference in Observational Factorial Studies", "link": "http://arxiv.org/abs/2310.04660", "description": "Many scientific questions in biomedical, environmental, and psychological\nresearch involve understanding the impact of multiple factors on outcomes.\nWhile randomized factorial experiments are ideal for this purpose,\nrandomization is infeasible in many empirical studies. Therefore, investigators\noften rely on observational data, where drawing reliable causal inferences for\nmultiple factors remains challenging. As the number of treatment combinations\ngrows exponentially with the number of factors, some treatment combinations can\nbe rare or even missing by chance in observed data, further complicating\nfactorial effects estimation. To address these challenges, we propose a novel\nweighting method tailored to observational studies with multiple factors. Our\napproach uses weighted observational data to emulate a randomized factorial\nexperiment, enabling simultaneous estimation of the effects of multiple factors\nand their interactions. Our investigations reveal a crucial nuance: achieving\nbalance among covariates, as in single-factor scenarios, is necessary but\ninsufficient for unbiasedly estimating factorial effects. Our findings suggest\nthat balancing the factors is also essential in multi-factor settings.\nMoreover, we extend our weighting method to handle missing treatment\ncombinations in observed data. Finally, we study the asymptotic behavior of the\nnew weighting estimators and propose a consistent variance estimator, providing\nreliable inferences on factorial effects in observational studies."}, "http://arxiv.org/abs/2310.04709": {"title": "Time-dependent mediators in survival analysis: Graphical representation of causal assumptions", "link": "http://arxiv.org/abs/2310.04709", "description": "We study time-dependent mediators in survival analysis using a treatment\nseparation approach due to Didelez [2019] and based on earlier work by Robins\nand Richardson [2011]. This approach avoids nested counterfactuals and\ncrossworld assumptions which are otherwise common in mediation analysis. The\ncausal model of treatment, mediators, covariates, confounders and outcome is\nrepresented by causal directed acyclic graphs (DAGs). However, the DAGs tend to\nbe very complex when we have measurements at a large number of time points. We\ntherefore suggest using so-called rolled graphs in which a node represents an\nentire coordinate process instead of a single random variable, leading us to\nfar simpler graphical representations. The rolled graphs are not necessarily\nacyclic; they can be analyzed by $\\delta$-separation which is the appropriate\ngraphical separation criterion in this class of graphs and analogous to\n$d$-separation. In particular, $\\delta$-separation is a graphical tool for\nevaluating if the conditions of the mediation analysis are met or if unmeasured\nconfounders influence the estimated effects. We also state a mediational\ng-formula. This is similar to the approach in Vansteelandt et al. [2019]\nalthough that paper has a different conceptual basis. Finally, we apply this\nframework to a statistical model based on a Cox model with an added treatment\neffect.survival analysis; mediation; causal inference; graphical models; local\nindependence graphs"}, "http://arxiv.org/abs/2310.04853": {"title": "On changepoint detection in functional data using empirical energy distance", "link": "http://arxiv.org/abs/2310.04853", "description": "We propose a novel family of test statistics to detect the presence of\nchangepoints in a sequence of dependent, possibly multivariate,\nfunctional-valued observations. Our approach allows to test for a very general\nclass of changepoints, including the \"classical\" case of changes in the mean,\nand even changes in the whole distribution. Our statistics are based on a\ngeneralisation of the empirical energy distance; we propose weighted\nfunctionals of the energy distance process, which are designed in order to\nenhance the ability to detect breaks occurring at sample endpoints. The\nlimiting distribution of the maximally selected version of our statistics\nrequires only the computation of the eigenvalues of the covariance function,\nthus being readily implementable in the most commonly employed packages, e.g.\nR. We show that, under the alternative, our statistics are able to detect\nchangepoints occurring even very close to the beginning/end of the sample. In\nthe presence of multiple changepoints, we propose a binary segmentation\nalgorithm to estimate the number of breaks and the locations thereof.\nSimulations show that our procedures work very well in finite samples. We\ncomplement our theory with applications to financial and temperature data."}, "http://arxiv.org/abs/2310.04919": {"title": "The Conditional Prediction Function: A Novel Technique to Control False Discovery Rate for Complex Models", "link": "http://arxiv.org/abs/2310.04919", "description": "In modern scientific research, the objective is often to identify which\nvariables are associated with an outcome among a large class of potential\npredictors. This goal can be achieved by selecting variables in a manner that\ncontrols the the false discovery rate (FDR), the proportion of irrelevant\npredictors among the selections. Knockoff filtering is a cutting-edge approach\nto variable selection that provides FDR control. Existing knockoff statistics\nfrequently employ linear models to assess relationships between features and\nthe response, but the linearity assumption is often violated in real world\napplications. This may result in poor power to detect truly prognostic\nvariables. We introduce a knockoff statistic based on the conditional\nprediction function (CPF), which can pair with state-of-art machine learning\npredictive models, such as deep neural networks. The CPF statistics can capture\nthe nonlinear relationships between predictors and outcomes while also\naccounting for correlation between features. We illustrate the capability of\nthe CPF statistics to provide superior power over common knockoff statistics\nwith continuous, categorical, and survival outcomes using repeated simulations.\nKnockoff filtering with the CPF statistics is demonstrated using (1) a\nresidential building dataset to select predictors for the actual sales prices\nand (2) the TCGA dataset to select genes that are correlated with disease\nstaging in lung cancer patients."}, "http://arxiv.org/abs/2310.04924": {"title": "Markov Chain Monte Carlo Significance Tests", "link": "http://arxiv.org/abs/2310.04924", "description": "Markov chain Monte Carlo significance tests were first introduced by Besag\nand Clifford in [4]. These methods produce statistical valid p-values in\nproblems where sampling from the null hypotheses is intractable. We give an\noverview of the methods of Besag and Clifford and some recent developments. A\nrange of examples and applications are discussed."}, "http://arxiv.org/abs/2310.04934": {"title": "UBSea: A Unified Community Detection Framework", "link": "http://arxiv.org/abs/2310.04934", "description": "Detecting communities in networks and graphs is an important task across many\ndisciplines such as statistics, social science and engineering. There are\ngenerally three different kinds of mixing patterns for the case of two\ncommunities: assortative mixing, disassortative mixing and core-periphery\nstructure. Modularity optimization is a classical way for fitting network\nmodels with communities. However, it can only deal with assortative mixing and\ndisassortative mixing when the mixing pattern is known and fails to discover\nthe core-periphery structure. In this paper, we extend modularity in a\nstrategic way and propose a new framework based on Unified Bigroups Standadized\nEdge-count Analysis (UBSea). It can address all the formerly mentioned\ncommunity mixing structures. In addition, this new framework is able to\nautomatically choose the mixing type to fit the networks. Simulation studies\nshow that the new framework has superb performance in a wide range of settings\nunder the stochastic block model and the degree-corrected stochastic block\nmodel. We show that the new approach produces consistent estimate of the\ncommunities under a suitable signal-to-noise-ratio condition, for the case of a\nblock model with two communities, for both undirected and directed networks.\nThe new method is illustrated through applications to several real-world\ndatasets."}, "http://arxiv.org/abs/2310.05049": {"title": "On Estimation of Optimal Dynamic Treatment Regimes with Multiple Treatments for Survival Data-With Application to Colorectal Cancer Study", "link": "http://arxiv.org/abs/2310.05049", "description": "Dynamic treatment regimes (DTR) are sequential decision rules corresponding\nto several stages of intervention. Each rule maps patients' covariates to\noptional treatments. The optimal dynamic treatment regime is the one that\nmaximizes the mean outcome of interest if followed by the overall population.\nMotivated by a clinical study on advanced colorectal cancer with traditional\nChinese medicine, we propose a censored C-learning (CC-learning) method to\nestimate the dynamic treatment regime with multiple treatments using survival\ndata. To address the challenges of multiple stages with right censoring, we\nmodify the backward recursion algorithm in order to adapt to the flexible\nnumber and timing of treatments. For handling the problem of multiple\ntreatments, we propose a framework from the classification perspective by\ntransferring the problem of optimization with multiple treatment comparisons\ninto an example-dependent cost-sensitive classification problem. With\nclassification and regression tree (CART) as the classifier, the CC-learning\nmethod can produce an estimated optimal DTR with good interpretability. We\ntheoretically prove the optimality of our method and numerically evaluate its\nfinite sample performances through simulation. With the proposed method, we\nidentify the interpretable tree treatment regimes at each stage for the\nadvanced colorectal cancer treatment data from Xiyuan Hospital."}, "http://arxiv.org/abs/2310.05151": {"title": "Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions", "link": "http://arxiv.org/abs/2310.05151", "description": "In clinical trials of longitudinal continuous outcomes, reference based\nimputation (RBI) has commonly been applied to handle missing outcome data in\nsettings where the estimand incorporates the effects of intercurrent events,\ne.g. treatment discontinuation. RBI was originally developed in the multiple\nimputation framework, however recently conditional mean imputation (CMI)\ncombined with the jackknife estimator of the standard error was proposed as a\nway to obtain deterministic treatment effect estimates and correct frequentist\ninference. For both multiple and CMI, a mixed model for repeated measures\n(MMRM) is often used for the imputation model, but this can be computationally\nintensive to fit to multiple data sets (e.g. the jackknife samples) and lead to\nconvergence issues with complex MMRM models with many parameters. Therefore, a\nstep-wise approach based on sequential linear regression (SLR) of the outcomes\nat each visit was developed for the imputation model in the multiple imputation\nframework, but similar developments in the CMI framework are lacking. In this\narticle, we fill this gap in the literature by proposing a SLR approach to\nimplement RBI in the CMI framework, and justify its validity using theoretical\nresults and simulations. We also illustrate our proposal on a real data\napplication."}, "http://arxiv.org/abs/2310.05398": {"title": "Statistical Inference for Modulation Index in Phase-Amplitude Coupling", "link": "http://arxiv.org/abs/2310.05398", "description": "Phase-amplitude coupling is a phenomenon observed in several neurological\nprocesses, where the phase of one signal modulates the amplitude of another\nsignal with a distinct frequency. The modulation index (MI) is a common\ntechnique used to quantify this interaction by assessing the Kullback-Leibler\ndivergence between a uniform distribution and the empirical conditional\ndistribution of amplitudes with respect to the phases of the observed signals.\nThe uniform distribution is an ideal representation that is expected to appear\nunder the absence of coupling. However, it does not reflect the statistical\nproperties of coupling values caused by random chance. In this paper, we\npropose a statistical framework for evaluating the significance of an observed\nMI value based on a null hypothesis that a MI value can be entirely explained\nby chance. Significance is obtained by comparing the value with a reference\ndistribution derived under the null hypothesis of independence (i.e., no\ncoupling) between signals. We derived a closed-form distribution of this null\nmodel, resulting in a scaled beta distribution. To validate the efficacy of our\nproposed framework, we conducted comprehensive Monte Carlo simulations,\nassessing the significance of MI values under various experimental scenarios,\nincluding amplitude modulation, trains of spikes, and sequences of\nhigh-frequency oscillations. Furthermore, we corroborated the reliability of\nour model by comparing its statistical significance thresholds with reported\nvalues from other research studies conducted under different experimental\nsettings. Our method offers several advantages such as meta-analysis\nreliability, simplicity and computational efficiency, as it provides p-values\nand significance levels without resorting to generating surrogate data through\nsampling procedures."}, "http://arxiv.org/abs/2310.05526": {"title": "Projecting infinite time series graphs to finite marginal graphs using number theory", "link": "http://arxiv.org/abs/2310.05526", "description": "In recent years, a growing number of method and application works have\nadapted and applied the causal-graphical-model framework to time series data.\nMany of these works employ time-resolved causal graphs that extend infinitely\ninto the past and future and whose edges are repetitive in time, thereby\nreflecting the assumption of stationary causal relationships. However, most\nresults and algorithms from the causal-graphical-model framework are not\ndesigned for infinite graphs. In this work, we develop a method for projecting\ninfinite time series graphs with repetitive edges to marginal graphical models\non a finite time window. These finite marginal graphs provide the answers to\n$m$-separation queries with respect to the infinite graph, a task that was\npreviously unresolved. Moreover, we argue that these marginal graphs are useful\nfor causal discovery and causal effect estimation in time series, effectively\nenabling to apply results developed for finite graphs to the infinite graphs.\nThe projection procedure relies on finding common ancestors in the\nto-be-projected graph and is, by itself, not new. However, the projection\nprocedure has not yet been algorithmically implemented for time series graphs\nsince in these infinite graphs there can be infinite sets of paths that might\ngive rise to common ancestors. We solve the search over these possibly infinite\nsets of paths by an intriguing combination of path-finding techniques for\nfinite directed graphs and solution theory for linear Diophantine equations. By\nproviding an algorithm that carries out the projection, our paper makes an\nimportant step towards a theoretically-grounded and method-agnostic\ngeneralization of a range of causal inference methods and results to time\nseries."}, "http://arxiv.org/abs/2310.05539": {"title": "Testing High-Dimensional Mediation Effect with Arbitrary Exposure-Mediator Coefficients", "link": "http://arxiv.org/abs/2310.05539", "description": "In response to the unique challenge created by high-dimensional mediators in\nmediation analysis, this paper presents a novel procedure for testing the\nnullity of the mediation effect in the presence of high-dimensional mediators.\nThe procedure incorporates two distinct features. Firstly, the test remains\nvalid under all cases of the composite null hypothesis, including the\nchallenging scenario where both exposure-mediator and mediator-outcome\ncoefficients are zero. Secondly, it does not impose structural assumptions on\nthe exposure-mediator coefficients, thereby allowing for an arbitrarily strong\nexposure-mediator relationship. To the best of our knowledge, the proposed test\nis the first of its kind to provably possess these two features in\nhigh-dimensional mediation analysis. The validity and consistency of the\nproposed test are established, and its numerical performance is showcased\nthrough simulation studies. The application of the proposed test is\ndemonstrated by examining the mediation effect of DNA methylation between\nsmoking status and lung cancer development."}, "http://arxiv.org/abs/2310.05548": {"title": "Cokrig-and-Regress for Spatially Misaligned Environmental Data", "link": "http://arxiv.org/abs/2310.05548", "description": "Spatially misaligned data, where the response and covariates are observed at\ndifferent spatial locations, commonly arise in many environmental studies. Much\nof the statistical literature on handling spatially misaligned data has been\ndevoted to the case of a single covariate and a linear relationship between the\nresponse and this covariate. Motivated by spatially misaligned data collected\non air pollution and weather in China, we propose a cokrig-and-regress (CNR)\nmethod to estimate spatial regression models involving multiple covariates and\npotentially non-linear associations. The CNR estimator is constructed by\nreplacing the unobserved covariates (at the response locations) by their\ncokriging predictor derived from the observed but misaligned covariates under a\nmultivariate Gaussian assumption, where a generalized Kronecker product\ncovariance is used to account for spatial correlations within and between\ncovariates. A parametric bootstrap approach is employed to bias-correct the CNR\nestimates of the spatial covariance parameters and for uncertainty\nquantification. Simulation studies demonstrate that CNR outperforms several\nexisting methods for handling spatially misaligned data, such as\nnearest-neighbor interpolation. Applying CNR to the spatially misaligned air\npollution and weather data in China reveals a number of non-linear\nrelationships between PM$_{2.5}$ concentration and several meteorological\ncovariates."}, "http://arxiv.org/abs/2310.05622": {"title": "A neutral comparison of statistical methods for time-to-event analyses under non-proportional hazards", "link": "http://arxiv.org/abs/2310.05622", "description": "While well-established methods for time-to-event data are available when the\nproportional hazards assumption holds, there is no consensus on the best\ninferential approach under non-proportional hazards (NPH). However, a wide\nrange of parametric and non-parametric methods for testing and estimation in\nthis scenario have been proposed. To provide recommendations on the statistical\nanalysis of clinical trials where non proportional hazards are expected, we\nconducted a comprehensive simulation study under different scenarios of\nnon-proportional hazards, including delayed onset of treatment effect, crossing\nhazard curves, subgroups with different treatment effect and changing hazards\nafter disease progression. We assessed type I error rate control, power and\nconfidence interval coverage, where applicable, for a wide range of methods\nincluding weighted log-rank tests, the MaxCombo test, summary measures such as\nthe restricted mean survival time (RMST), average hazard ratios, and milestone\nsurvival probabilities as well as accelerated failure time regression models.\nWe found a trade-off between interpretability and power when choosing an\nanalysis strategy under NPH scenarios. While analysis methods based on weighted\nlogrank tests typically were favorable in terms of power, they do not provide\nan easily interpretable treatment effect estimate. Also, depending on the\nweight function, they test a narrow null hypothesis of equal hazard functions\nand rejection of this null hypothesis may not allow for a direct conclusion of\ntreatment benefit in terms of the survival function. In contrast,\nnon-parametric procedures based on well interpretable measures as the RMST\ndifference had lower power in most scenarios. Model based methods based on\nspecific survival distributions had larger power, however often gave biased\nestimates and lower than nominal confidence interval coverage."}, "http://arxiv.org/abs/2310.05646": {"title": "Transfer learning for piecewise-constant mean estimation: Optimality, $\\ell_1$- and $\\ell_0$-penalisation", "link": "http://arxiv.org/abs/2310.05646", "description": "We study transfer learning in the context of estimating piecewise-constant\nsignals when source data, which may be relevant but disparate, are available in\naddition to the target data. We initially investigate transfer learning\nestimators that respectively employ $\\ell_1$- and $\\ell_0$-penalties for\nunisource data scenarios and then generalise these estimators to accommodate\nmultisource data. To further reduce estimation errors, especially in scenarios\nwhere some sources significantly differ from the target, we introduce an\ninformative source selection algorithm. We then examine these estimators with\nmultisource selection and establish their minimax optimality under specific\nregularity conditions. It is worth emphasising that, unlike the prevalent\nnarrative in the transfer learning literature that the performance is enhanced\nthrough large source sample sizes, our approaches leverage higher observation\nfrequencies and accommodate diverse frequencies across multiple sources. Our\ntheoretical findings are empirically validated through extensive numerical\nexperiments, with the code available online, see\nhttps://github.com/chrisfanwang/transferlearning"}, "http://arxiv.org/abs/2310.05685": {"title": "Post-Selection Inference for Sparse Estimation", "link": "http://arxiv.org/abs/2310.05685", "description": "When the model is not known and parameter testing or interval estimation is\nconducted after model selection, it is necessary to consider selective\ninference. This paper discusses this issue in the context of sparse estimation.\nFirstly, we describe selective inference related to Lasso as per \\cite{lee},\nand then present polyhedra and truncated distributions when applying it to\nmethods such as Forward Stepwise and LARS. Lastly, we discuss the Significance\nTest for Lasso by \\cite{significant} and the Spacing Test for LARS by\n\\cite{ryan_exact}. This paper serves as a review article.\n\nKeywords: post-selective inference, polyhedron, LARS, lasso, forward\nstepwise, significance test, spacing test."}, "http://arxiv.org/abs/2310.05921": {"title": "Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions", "link": "http://arxiv.org/abs/2310.05921", "description": "We introduce Conformal Decision Theory, a framework for producing safe\nautonomous decisions despite imperfect machine learning predictions. Examples\nof such decisions are ubiquitous, from robot planning algorithms that rely on\npedestrian predictions, to calibrating autonomous manufacturing to exhibit high\nthroughput and low error, to the choice of trusting a nominal policy versus\nswitching to a safe backup policy at run-time. The decisions produced by our\nalgorithms are safe in the sense that they come with provable statistical\nguarantees of having low risk without any assumptions on the world model\nwhatsoever; the observations need not be I.I.D. and can even be adversarial.\nThe theory extends results from conformal prediction to calibrate decisions\ndirectly, without requiring the construction of prediction sets. Experiments\ndemonstrate the utility of our approach in robot motion planning around humans,\nautomated stock trading, and robot manufacturin"}, "http://arxiv.org/abs/2101.06950": {"title": "Learning and scoring Gaussian latent variable causal models with unknown additive interventions", "link": "http://arxiv.org/abs/2101.06950", "description": "With observational data alone, causal structure learning is a challenging\nproblem. The task becomes easier when having access to data collected from\nperturbations of the underlying system, even when the nature of these is\nunknown. Existing methods either do not allow for the presence of latent\nvariables or assume that these remain unperturbed. However, these assumptions\nare hard to justify if the nature of the perturbations is unknown. We provide\nresults that enable scoring causal structures in the setting with additive, but\nunknown interventions. Specifically, we propose a maximum-likelihood estimator\nin a structural equation model that exploits system-wide invariances to output\nan equivalence class of causal structures from perturbation data. Furthermore,\nunder certain structural assumptions on the population model, we provide a\nsimple graphical characterization of all the DAGs in the interventional\nequivalence class. We illustrate the utility of our framework on synthetic data\nas well as real data involving California reservoirs and protein expressions.\nThe software implementation is available as the Python package \\emph{utlvce}."}, "http://arxiv.org/abs/2107.14151": {"title": "Modern Non-Linear Function-on-Function Regression", "link": "http://arxiv.org/abs/2107.14151", "description": "We introduce a new class of non-linear function-on-function regression models\nfor functional data using neural networks. We propose a framework using a\nhidden layer consisting of continuous neurons, called a continuous hidden\nlayer, for functional response modeling and give two model fitting strategies,\nFunctional Direct Neural Network (FDNN) and Functional Basis Neural Network\n(FBNN). Both are designed explicitly to exploit the structure inherent in\nfunctional data and capture the complex relations existing between the\nfunctional predictors and the functional response. We fit these models by\nderiving functional gradients and implement regularization techniques for more\nparsimonious results. We demonstrate the power and flexibility of our proposed\nmethod in handling complex functional models through extensive simulation\nstudies as well as real data examples."}, "http://arxiv.org/abs/2112.00832": {"title": "On the mixed-model analysis of covariance in cluster-randomized trials", "link": "http://arxiv.org/abs/2112.00832", "description": "In the analyses of cluster-randomized trials, mixed-model analysis of\ncovariance (ANCOVA) is a standard approach for covariate adjustment and\nhandling within-cluster correlations. However, when the normality, linearity,\nor the random-intercept assumption is violated, the validity and efficiency of\nthe mixed-model ANCOVA estimators for estimating the average treatment effect\nremain unclear. Under the potential outcomes framework, we prove that the\nmixed-model ANCOVA estimators for the average treatment effect are consistent\nand asymptotically normal under arbitrary misspecification of its working\nmodel. If the probability of receiving treatment is 0.5 for each cluster, we\nfurther show that the model-based variance estimator under mixed-model ANCOVA1\n(ANCOVA without treatment-covariate interactions) remains consistent,\nclarifying that the confidence interval given by standard software is\nasymptotically valid even under model misspecification. Beyond robustness, we\ndiscuss several insights on precision among classical methods for analyzing\ncluster-randomized trials, including the mixed-model ANCOVA, individual-level\nANCOVA, and cluster-level ANCOVA estimators. These insights may inform the\nchoice of methods in practice. Our analytical results and insights are\nillustrated via simulation studies and analyses of three cluster-randomized\ntrials."}, "http://arxiv.org/abs/2201.10770": {"title": "Confidence intervals for the Cox model test error from cross-validation", "link": "http://arxiv.org/abs/2201.10770", "description": "Cross-validation (CV) is one of the most widely used techniques in\nstatistical learning for estimating the test error of a model, but its behavior\nis not yet fully understood. It has been shown that standard confidence\nintervals for test error using estimates from CV may have coverage below\nnominal levels. This phenomenon occurs because each sample is used in both the\ntraining and testing procedures during CV and as a result, the CV estimates of\nthe errors become correlated. Without accounting for this correlation, the\nestimate of the variance is smaller than it should be. One way to mitigate this\nissue is by estimating the mean squared error of the prediction error instead\nusing nested CV. This approach has been shown to achieve superior coverage\ncompared to intervals derived from standard CV. In this work, we generalize the\nnested CV idea to the Cox proportional hazards model and explore various\nchoices of test error for this setting."}, "http://arxiv.org/abs/2202.08419": {"title": "High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2202.08419", "description": "In this paper, we develop a novel high-dimensional time-varying coefficient\nestimation method, based on high-dimensional Ito diffusion processes. To\naccount for high-dimensional time-varying coefficients, we first estimate local\n(or instantaneous) coefficients using a time-localized Dantzig selection scheme\nunder a sparsity condition, which results in biased local coefficient\nestimators due to the regularization. To handle the bias, we propose a\ndebiasing scheme, which provides well-performing unbiased local coefficient\nestimators. With the unbiased local coefficient estimators, we estimate the\nintegrated coefficient, and to further account for the sparsity of the\ncoefficient process, we apply thresholding schemes. We call this Thresholding\ndEbiased Dantzig (TED). We establish asymptotic properties of the proposed TED\nestimator. In the empirical analysis, we apply the TED procedure to analyzing\nhigh-dimensional factor models using high-frequency data."}, "http://arxiv.org/abs/2206.12525": {"title": "Causality of Functional Longitudinal Data", "link": "http://arxiv.org/abs/2206.12525", "description": "\"Treatment-confounder feedback\" is the central complication to resolve in\nlongitudinal studies, to infer causality. The existing frameworks for\nidentifying causal effects for longitudinal studies with discrete repeated\nmeasures hinge heavily on assuming that time advances in discrete time steps or\ntreatment changes as a jumping process, rendering the number of \"feedbacks\"\nfinite. However, medical studies nowadays with real-time monitoring involve\nfunctional time-varying outcomes, treatment, and confounders, which leads to an\nuncountably infinite number of feedbacks between treatment and confounders.\nTherefore more general and advanced theory is needed. We generalize the\ndefinition of causal effects under user-specified stochastic treatment regimes\nto longitudinal studies with continuous monitoring and develop an\nidentification framework, allowing right censoring and truncation by death. We\nprovide sufficient identification assumptions including a generalized\nconsistency assumption, a sequential randomization assumption, a positivity\nassumption, and a novel \"intervenable\" assumption designed for the\ncontinuous-time case. Under these assumptions, we propose a g-computation\nprocess and an inverse probability weighting process, which suggest a\ng-computation formula and an inverse probability weighting formula for\nidentification. For practical purposes, we also construct two classes of\npopulation estimating equations to identify these two processes, respectively,\nwhich further suggest a doubly robust identification formula with extra\nrobustness against process misspecification. We prove that our framework fully\ngeneralize the existing frameworks and is nonparametric."}, "http://arxiv.org/abs/2209.08139": {"title": "Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm", "link": "http://arxiv.org/abs/2209.08139", "description": "Bayesian variable selection methods are powerful techniques for fitting and\ninferring on sparse high-dimensional linear regression models. However, many\nare computationally intensive or require restrictive prior distributions on\nmodel parameters. In this paper, we proposed a computationally efficient and\npowerful Bayesian approach for sparse high-dimensional linear regression.\nMinimal prior assumptions on the parameters are required through the use of\nplug-in empirical Bayes estimates of hyperparameters. Efficient maximum a\nposteriori (MAP) estimation is completed through a Parameter-Expanded\nExpectation-Conditional-Maximization (PX-ECM) algorithm. The PX-ECM results in\na robust computationally efficient coordinate-wise optimization which -- when\nupdating the coefficient for a particular predictor -- adjusts for the impact\nof other predictor variables. The completion of the E-step uses an approach\nmotivated by the popular two-group approach to multiple testing. The result is\na PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse\nhigh-dimensional linear regression, which can be completed using one-at-a-time\nor all-at-once type optimization. We compare the empirical properties of PROBE\nto comparable approaches with numerous simulation studies and analyses of\ncancer cell drug responses. The proposed approach is implemented in the R\npackage probe."}, "http://arxiv.org/abs/2212.02709": {"title": "SURE-tuned Bridge Regression", "link": "http://arxiv.org/abs/2212.02709", "description": "Consider the {$\\ell_{\\alpha}$} regularized linear regression, also termed\nBridge regression. For $\\alpha\\in (0,1)$, Bridge regression enjoys several\nstatistical properties of interest such as sparsity and near-unbiasedness of\nthe estimates (Fan and Li, 2001). However, the main difficulty lies in the\nnon-convex nature of the penalty for these values of $\\alpha$, which makes an\noptimization procedure challenging and usually it is only possible to find a\nlocal optimum. To address this issue, Polson et al. (2013) took a sampling\nbased fully Bayesian approach to this problem, using the correspondence between\nthe Bridge penalty and a power exponential prior on the regression\ncoefficients. However, their sampling procedure relies on Markov chain Monte\nCarlo (MCMC) techniques, which are inherently sequential and not scalable to\nlarge problem dimensions. Cross validation approaches are similarly\ncomputation-intensive. To this end, our contribution is a novel\n\\emph{non-iterative} method to fit a Bridge regression model. The main\ncontribution lies in an explicit formula for Stein's unbiased risk estimate for\nthe out of sample prediction risk of Bridge regression, which can then be\noptimized to select the desired tuning parameters, allowing us to completely\nbypass MCMC as well as computation-intensive cross validation approaches. Our\nprocedure yields results in a fraction of computational times compared to\niterative schemes, without any appreciable loss in statistical performance. An\nR implementation is publicly available online at:\nhttps://github.com/loriaJ/Sure-tuned_BridgeRegression ."}, "http://arxiv.org/abs/2212.03122": {"title": "Robust convex biclustering with a tuning-free method", "link": "http://arxiv.org/abs/2212.03122", "description": "Biclustering is widely used in different kinds of fields including gene\ninformation analysis, text mining, and recommendation system by effectively\ndiscovering the local correlation between samples and features. However, many\nbiclustering algorithms will collapse when facing heavy-tailed data. In this\npaper, we propose a robust version of convex biclustering algorithm with Huber\nloss. Yet, the newly introduced robustification parameter brings an extra\nburden to selecting the optimal parameters. Therefore, we propose a tuning-free\nmethod for automatically selecting the optimal robustification parameter with\nhigh efficiency. The simulation study demonstrates the more fabulous\nperformance of our proposed method than traditional biclustering methods when\nencountering heavy-tailed noise. A real-life biomedical application is also\npresented. The R package RcvxBiclustr is available at\nhttps://github.com/YifanChen3/RcvxBiclustr."}, "http://arxiv.org/abs/2301.09661": {"title": "Estimating marginal treatment effects from observational studies and indirect treatment comparisons: When are standardization-based methods preferable to those based on propensity score weighting?", "link": "http://arxiv.org/abs/2301.09661", "description": "In light of newly developed standardization methods, we evaluate, via\nsimulation study, how propensity score weighting and standardization -based\napproaches compare for obtaining estimates of the marginal odds ratio and the\nmarginal hazard ratio. Specifically, we consider how the two approaches compare\nin two different scenarios: (1) in a single observational study, and (2) in an\nanchored indirect treatment comparison (ITC) of randomized controlled trials.\nWe present the material in such a way so that the matching-adjusted indirect\ncomparison (MAIC) and the (novel) simulated treatment comparison (STC) methods\nin the ITC setting may be viewed as analogous to the propensity score weighting\nand standardization methods in the single observational study setting. Our\nresults suggest that current recommendations for conducting ITCs can be\nimproved and underscore the importance of adjusting for purely prognostic\nfactors."}, "http://arxiv.org/abs/2302.11746": {"title": "Logistic Regression and Classification with non-Euclidean Covariates", "link": "http://arxiv.org/abs/2302.11746", "description": "We introduce a logistic regression model for data pairs consisting of a\nbinary response and a covariate residing in a non-Euclidean metric space\nwithout vector structures. Based on the proposed model we also develop a binary\nclassifier for non-Euclidean objects. We propose a maximum likelihood estimator\nfor the non-Euclidean regression coefficient in the model, and provide upper\nbounds on the estimation error under various metric entropy conditions that\nquantify complexity of the underlying metric space. Matching lower bounds are\nderived for the important metric spaces commonly seen in statistics,\nestablishing optimality of the proposed estimator in such spaces. Similarly, an\nupper bound on the excess risk of the developed classifier is provided for\ngeneral metric spaces. A finer upper bound and a matching lower bound, and thus\noptimality of the proposed classifier, are established for Riemannian\nmanifolds. We investigate the numerical performance of the proposed estimator\nand classifier via simulation studies, and illustrate their practical merits\nvia an application to task-related fMRI data."}, "http://arxiv.org/abs/2302.13658": {"title": "Robust High-Dimensional Time-Varying Coefficient Estimation", "link": "http://arxiv.org/abs/2302.13658", "description": "In this paper, we develop a novel high-dimensional coefficient estimation\nprocedure based on high-frequency data. Unlike usual high-dimensional\nregression procedure such as LASSO, we additionally handle the heavy-tailedness\nof high-frequency observations as well as time variations of coefficient\nprocesses. Specifically, we employ Huber loss and truncation scheme to handle\nheavy-tailed observations, while $\\ell_{1}$-regularization is adopted to\novercome the curse of dimensionality. To account for the time-varying\ncoefficient, we estimate local coefficients which are biased due to the\n$\\ell_{1}$-regularization. Thus, when estimating integrated coefficients, we\npropose a debiasing scheme to enjoy the law of large number property and employ\na thresholding scheme to further accommodate the sparsity of the coefficients.\nWe call this Robust thrEsholding Debiased LASSO (RED-LASSO) estimator. We show\nthat the RED-LASSO estimator can achieve a near-optimal convergence rate. In\nthe empirical study, we apply the RED-LASSO procedure to the high-dimensional\nintegrated coefficient estimation using high-frequency trading data."}, "http://arxiv.org/abs/2307.04754": {"title": "Action-State Dependent Dynamic Model Selection", "link": "http://arxiv.org/abs/2307.04754", "description": "A model among many may only be best under certain states of the world.\nSwitching from a model to another can also be costly. Finding a procedure to\ndynamically choose a model in these circumstances requires to solve a complex\nestimation procedure and a dynamic programming problem. A Reinforcement\nlearning algorithm is used to approximate and estimate from the data the\noptimal solution to this dynamic programming problem. The algorithm is shown to\nconsistently estimate the optimal policy that may choose different models based\non a set of covariates. A typical example is the one of switching between\ndifferent portfolio models under rebalancing costs, using macroeconomic\ninformation. Using a set of macroeconomic variables and price data, an\nempirical application to the aforementioned portfolio problem shows superior\nperformance to choosing the best portfolio model with hindsight."}, "http://arxiv.org/abs/2307.14828": {"title": "Identifying regime switches through Bayesian wavelet estimation: evidence from flood detection in the Taquari River Valley", "link": "http://arxiv.org/abs/2307.14828", "description": "Two-component mixture models have proved to be a powerful tool for modeling\nheterogeneity in several cluster analysis contexts. However, most methods based\non these models assume a constant behavior for the mixture weights, which can\nbe restrictive and unsuitable for some applications. In this paper, we relax\nthis assumption and allow the mixture weights to vary according to the index\n(e.g., time) to make the model more adaptive to a broader range of data sets.\nWe propose an efficient MCMC algorithm to jointly estimate both component\nparameters and dynamic weights from their posterior samples. We evaluate the\nmethod's performance by running Monte Carlo simulation studies under different\nscenarios for the dynamic weights. In addition, we apply the algorithm to a\ntime series that records the level reached by a river in southern Brazil. The\nTaquari River is a water body whose frequent flood inundations have caused\nvarious damage to riverside communities. Implementing a dynamic mixture model\nallows us to properly describe the flood regimes for the areas most affected by\nthese phenomena."}}
\ No newline at end of file