Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in risk ratio formula and boundary condition in legacy and lib source code files #266

Open
ganesh-srinivas opened this issue Jun 13, 2018 · 0 comments

Comments

@ganesh-srinivas
Copy link

tldr

Risk Ratio formula and one boundary condition is implemented incorrectly. A fix is provided below. I can do a Pull Request if you want me to.

Issue

In Section 5.1 (Semantics: Support and Risk Ratio) of the MacroBase paper, risk ratio was defined as:
riskRatio = (a_o / (a_o + a_i) ) / ( b_o / (b_o + b_i) )

When translated using MacroBase source code variables, the risk ratio expression should be:
riskRatio = (exposedOutlierCount / totalExposedCount) / (exposedInlierCount / totalInlierCount)

In the MacroBase GUI Tutorial provided as part of the documentation, the following write-up is provided for risk ratio statistic. Please note that the example in the tutorial provides an incorrect risk ratio value due to the bug I'm reporting.

Ratio Out/In is the proportion of outlier records containing this attribute combination compared to the proportion of inlier records containing this attribute combination (i.e., support in outliers divided by support in inliers). A ratio of 1 means that this pattern appeared equally frequently in inlier and outliers. A ratio of infinity means this pattern was not present in the inliers.

Instead of the above expression, the following is seen (line 46-47).

// all outliers had this pattern
if (unexposedOutlierCount == 0) {
return new RiskRatioResult(Double.POSITIVE_INFINITY);
}
double z = 2.0;
double correction = z*Math.sqrt(
(exposedInlierCount / exposedOutlierCount)/totalExposedCount
+ (unexposedInlierCount / unexposedInlierCount)/totalMinusExposedCount
);
return new RiskRatioResult((exposedOutlierCount / totalExposedCount) /
(unexposedOutlierCount / totalMinusExposedCount), correction);
}

In addition to this bug, a boundary condition is implemented incorrectly (line 35-38). When all outliers have the pattern, risk ratio need not be INFINITY. Risk ratio becomes INFINITY only when none of the inliers have the pattern i.e., when exposedInlierCount == 0.

Fix

These bugs in Risk Ratio computation can be fixed by introducing the following modification for the legacy and lib source code files that compute Risk Ratio.

        if (exposedInlierCount == 0) {
            return  new RiskRatioResult(Double.POSITIVE_INFINITY);
        }

        double z = 2.0;
        double correction = z*Math.sqrt(
                (exposedInlierCount / exposedOutlierCount)/totalExposedCount
                + (unexposedInlierCount / unexposedInlierCount)/totalMinusExposedCount
        );

        return new RiskRatioResult((exposedOutlierCount / totalOutliers) /
                                   (exposedInlierCount / totalInliers), correction);

I also found two tests that need to be modified since they check boundary condition behavior incorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant