DiscrimTwoSample output depends on input order #422

ameliecr · 2024-11-11T22:20:00Z

When running the DiscrimTwoSample.test function, I am getting different results depending on the order of my x1 and x2 input. Please see code and output below.
After looking into the code, I think the problem probably lies with the removal of isolates in the data. It seems that when providing two matrices, the isolates are only removed from the first matrix. When removing the isolates before passing the matrices to the DiscrimTwoSample.test function, I receive the expected results.

Reproducing code example:

import pandas as pd
import numpy as np
import re
from hyppo.discrim import DiscrimTwoSample

def get_discrim_two_sample(dice_1: str, dice_2: str):
    dices_1 = pd.read_csv(dice_1, index_col=0, na_values=[""]).values
    distances_1 = 1 - dices_1
    subject_ids_1 = pd.read_csv(dice_1, nrows=0).columns.values[1:]
    subject_ids_1 = np.array(
        [re.sub(r"sub-", "", col) for col in subject_ids_1]
    )

    
    dices_2 = pd.read_csv(dice_2, index_col=0, na_values=[""]).values
    distances_2 = 1 - dices_2
    subject_ids_2 = pd.read_csv(dice_2, nrows=0).columns.values[1:]
    subject_ids_2 = np.array(
        [re.sub(r"sub-", "", col) for col in subject_ids_2]
    )

    # Remove rows and columns from the distance matrix that only contain NaNs
    # This will exclude runs that the bundle of interest couldn't be reconstructed for
    rows_to_keep_1 = ~np.isnan(distances_1).all(axis=1)
    distances_1 = distances_1[rows_to_keep_1]
    distances_1 = distances_1[:, rows_to_keep_1]
    subject_ids_1 = subject_ids_1[rows_to_keep_1]
    rows_to_keep_2 = ~np.isnan(distances_2).all(axis=1)
    distances_2 = distances_2[rows_to_keep_2]
    distances_2 = distances_2[:, rows_to_keep_2]
    subject_ids_2 = subject_ids_2[rows_to_keep_2]

    # Remove all subID-run combos that are not present in both distance matrices
    # Step 1: Find and sort the common subject-run combinations for stable indexing
    common_subids = np.intersect1d(subject_ids_1, subject_ids_2)
    common_subids.sort()  # Sort to ensure consistency in ordering

    # Step 2: Find indices of common combinations in both vectors using sorted common_subids
    indices1 = np.searchsorted(subject_ids_1, common_subids)
    indices2 = np.searchsorted(subject_ids_2, common_subids)

    # Step 3: Filter matrices accordingly, ensuring consistent ordering
    filtered_distances_1 = distances_1[np.ix_(indices1, indices1)]
    filtered_distances_2 = distances_2[np.ix_(indices2, indices2)]

    # remove run from the subject IDS so that they can be converted to float
    common_subids = np.array(
        [re.sub(r"\_run-\d+", "", col) for col in common_subids]
    )

    two_sample_output = DiscrimTwoSample(is_dist=True, remove_isolates=True).test(filtered_distances_1, filtered_distances_2, common_subids, workers=1)
    print(two_sample_output)
    two_sample_output = DiscrimTwoSample(is_dist=True, remove_isolates=True).test(filtered_distances_2, filtered_distances_1, common_subids, workers=1)
    print(two_sample_output)

    return two_sample_output


dice1 = '/Users/amelier/Code/dice_GQI/ProjectionBrainstemDentatorubrothalamicTractlr.csv'
dice2 = "/Users/amelier/Code/dice_SS3T/ProjectionBrainstemDentatorubrothalamicTractlr.csv"

get_discrim_two_sample(dice1, dice2)

Output:

DiscrimTwoSampleTestOutput(d1=0.6447122262572907, d2=0.5462221671985621, pvalue=0.001)
DiscrimTwoSampleTestOutput(d1=0.6224689116320018, d2=0.5630685778217968, pvalue=2.002002002002002e-06)

Version information

OS: macOS
Python Version 3.12.5
Package Version 0.5.1

The text was updated successfully, but these errors were encountered:

ameliecr added the bug Something isn't working label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiscrimTwoSample output depends on input order #422

DiscrimTwoSample output depends on input order #422

ameliecr commented Nov 11, 2024

DiscrimTwoSample output depends on input order #422

DiscrimTwoSample output depends on input order #422

Comments

ameliecr commented Nov 11, 2024

Reproducing code example:

Output:

Version information