You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear team, first of all thank you for maintaining this useful package. I was trying to run the independence test by Shen et al. (2022), but I received some weird results, such as a rejection of independence even though I knew that the two one-dimensional vectors I was testing were independent. It was only when working with large samples (more than 70,000 observations) that I noticed something was off.
Although I have not yet identified the source of the problem in the code, I have a script available that reproduces the issue I am referring to. Please note that I have modified the method statistic in the dcorr.py to output both "stat" and "covar".
Reproducing code example:
t = hyppo.independence.Dcorr(); t.is_fast = True
n_samples = [100,1000,10000,50000,70000,100000]
for n in n_samples:
U1 = np.random.rand(n,1)
U2 = np.random.rand(n,1)
S1 = np.sqrt(-2*np.log(U1))*np.cos(2*np.pi*U2)
S2 = np.sqrt(-2*np.log(U1))*np.sin(2*np.pi*U2)
print(f'current n is: {n}')
print(f'Implementation according to R/original paper (unbiased squared distance covariance): {_r_distance_corr(S1, S2, mode = "squared_cov", unbiased = True)}')
print(f'Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): {t.statistic(S1,S2)[1]}')
print(f'Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): {t.statistic(S1,S2)[0]}\n')
Please, see that as the sample size increases (n = 70000 and n = 100000) the hyppo unbiased squared distance covariance becomes unintuitive.
Results
current n is: 100
Implementation according to R/original paper (unbiased squared distance covariance): 0.017224788665771484
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 0.017225187792305974
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.052031725347361675
current n is: 1000
Implementation according to R/original paper (unbiased squared distance covariance): 3.993511199951172e-05
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 4.020496229051318e-05
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 9.702530109483742e-05
current n is: 10000
Implementation according to R/original paper (unbiased squared distance covariance): 8.52346420288086e-05
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 8.585070210065382e-05
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.000211508104872705
current n is: 50000
Implementation according to R/original paper (unbiased squared distance covariance): -7.987022399902344e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): -8.067143002499222e-06
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): -1.9942861479842804e-05
current n is: 70000
Implementation according to R/original paper (unbiased squared distance covariance): -6.318092346191406e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 4.23422066658965
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.9132199516986869
current n is: 100000
Implementation according to R/original paper (unbiased squared distance covariance): 1.430511474609375e-06
Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): 15.134514035440011
Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): 0.9742237257727073
Version information
OS: Any
Python Version: Any
Package Version 0.3.2 and 0.4.0
The text was updated successfully, but these errors were encountered:
hello @loremarchi I reviewed your code and error and found out may be it is because of Implementation Differences or Statistical Power . To resolve this you may use another library or TRY THIS CODE:
import numpy as np
from scipy.spatial.distance import squareform, pdist
def _r_distance_corr(X, Y, mode="squared_cov", unbiased=True):
X = X.flatten()
Y = Y.flatten()
n = len(X)
a = squareform(pdist(X[:, np.newaxis]))
b = squareform(pdist(Y[:, np.newaxis]))
A = a - a.mean(axis=0)[np.newaxis, :] - a.mean(axis=1)[:, np.newaxis] + a.mean()
B = b - b.mean(axis=0)[np.newaxis, :] - b.mean(axis=1)[:, np.newaxis] + b.mean()
if mode == "squared_cov":
cov = (A * B).mean()
if unbiased:
var_X = np.var(X, ddof=1)
var_Y = np.var(Y, ddof=1)
return cov / np.sqrt(var_X * var_Y)
else:
return cov
elif mode == "cov":
cov = (A * B).mean()
return cov
elif mode == "corr":
corr = (A * B).mean() / np.sqrt((A ** 2).mean() * (B ** 2).mean())
return corr
t = hyppo.independence.Dcorr()
t.is_fast = True
n_samples = [100, 1000, 10000, 50000, 70000, 100000]
for n in n_samples:
U1 = np.random.rand(n, 1)
U2 = np.random.rand(n, 1)
S1 = np.sqrt(-2 * np.log(U1)) * np.cos(2 * np.pi * U2)
S2 = np.sqrt(-2 * np.log(U1)) * np.sin(2 * np.pi * U2)
print(f'current n is: {n}')
print(f'Implementation according to R/original paper (unbiased squared distance covariance): {_r_distance_corr(S1, S2, mode="squared_cov", unbiased=True)}')
print(f'Implementation according to hyppo (unbiased squared distance covariance, covar in dcorr.py): {t.statistic(S1, S2)[1]}')
print(f'Implementation according to hyppo (unbiased distance correlation, stat in dcorr.py): {t.statistic(S1, S2)[0]}\n')
Dear team, first of all thank you for maintaining this useful package. I was trying to run the independence test by Shen et al. (2022), but I received some weird results, such as a rejection of independence even though I knew that the two one-dimensional vectors I was testing were independent. It was only when working with large samples (more than 70,000 observations) that I noticed something was off.
Although I have not yet identified the source of the problem in the code, I have a script available that reproduces the issue I am referring to. Please note that I have modified the method statistic in the dcorr.py to output both "stat" and "covar".
Reproducing code example:
Please, see that as the sample size increases (n = 70000 and n = 100000) the hyppo unbiased squared distance covariance becomes unintuitive.
Results
Version information
The text was updated successfully, but these errors were encountered: