Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiscrimTwoSample output depends on input order #422

Open
ameliecr opened this issue Nov 11, 2024 · 0 comments
Open

DiscrimTwoSample output depends on input order #422

ameliecr opened this issue Nov 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ameliecr
Copy link

When running the DiscrimTwoSample.test function, I am getting different results depending on the order of my x1 and x2 input. Please see code and output below.
After looking into the code, I think the problem probably lies with the removal of isolates in the data. It seems that when providing two matrices, the isolates are only removed from the first matrix. When removing the isolates before passing the matrices to the DiscrimTwoSample.test function, I receive the expected results.

Reproducing code example:

import pandas as pd
import numpy as np
import re
from hyppo.discrim import DiscrimTwoSample

def get_discrim_two_sample(dice_1: str, dice_2: str):
    dices_1 = pd.read_csv(dice_1, index_col=0, na_values=[""]).values
    distances_1 = 1 - dices_1
    subject_ids_1 = pd.read_csv(dice_1, nrows=0).columns.values[1:]
    subject_ids_1 = np.array(
        [re.sub(r"sub-", "", col) for col in subject_ids_1]
    )

    
    dices_2 = pd.read_csv(dice_2, index_col=0, na_values=[""]).values
    distances_2 = 1 - dices_2
    subject_ids_2 = pd.read_csv(dice_2, nrows=0).columns.values[1:]
    subject_ids_2 = np.array(
        [re.sub(r"sub-", "", col) for col in subject_ids_2]
    )

    # Remove rows and columns from the distance matrix that only contain NaNs
    # This will exclude runs that the bundle of interest couldn't be reconstructed for
    rows_to_keep_1 = ~np.isnan(distances_1).all(axis=1)
    distances_1 = distances_1[rows_to_keep_1]
    distances_1 = distances_1[:, rows_to_keep_1]
    subject_ids_1 = subject_ids_1[rows_to_keep_1]
    rows_to_keep_2 = ~np.isnan(distances_2).all(axis=1)
    distances_2 = distances_2[rows_to_keep_2]
    distances_2 = distances_2[:, rows_to_keep_2]
    subject_ids_2 = subject_ids_2[rows_to_keep_2]

    # Remove all subID-run combos that are not present in both distance matrices
    # Step 1: Find and sort the common subject-run combinations for stable indexing
    common_subids = np.intersect1d(subject_ids_1, subject_ids_2)
    common_subids.sort()  # Sort to ensure consistency in ordering

    # Step 2: Find indices of common combinations in both vectors using sorted common_subids
    indices1 = np.searchsorted(subject_ids_1, common_subids)
    indices2 = np.searchsorted(subject_ids_2, common_subids)

    # Step 3: Filter matrices accordingly, ensuring consistent ordering
    filtered_distances_1 = distances_1[np.ix_(indices1, indices1)]
    filtered_distances_2 = distances_2[np.ix_(indices2, indices2)]

    # remove run from the subject IDS so that they can be converted to float
    common_subids = np.array(
        [re.sub(r"\_run-\d+", "", col) for col in common_subids]
    )

    two_sample_output = DiscrimTwoSample(is_dist=True, remove_isolates=True).test(filtered_distances_1, filtered_distances_2, common_subids, workers=1)
    print(two_sample_output)
    two_sample_output = DiscrimTwoSample(is_dist=True, remove_isolates=True).test(filtered_distances_2, filtered_distances_1, common_subids, workers=1)
    print(two_sample_output)

    return two_sample_output


dice1 = '/Users/amelier/Code/dice_GQI/ProjectionBrainstemDentatorubrothalamicTractlr.csv'
dice2 = "/Users/amelier/Code/dice_SS3T/ProjectionBrainstemDentatorubrothalamicTractlr.csv"

get_discrim_two_sample(dice1, dice2)

Output:

DiscrimTwoSampleTestOutput(d1=0.6447122262572907, d2=0.5462221671985621, pvalue=0.001)
DiscrimTwoSampleTestOutput(d1=0.6224689116320018, d2=0.5630685778217968, pvalue=2.002002002002002e-06)

Version information

  • OS: macOS
  • Python Version 3.12.5
  • Package Version 0.5.1
@ameliecr ameliecr added the bug Something isn't working label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant