Bug in risk ratio formula and boundary condition in legacy and lib source code files #266

ganesh-srinivas · 2018-06-13T08:38:54Z

tldr

Risk Ratio formula and one boundary condition is implemented incorrectly. A fix is provided below. I can do a Pull Request if you want me to.

Issue

In Section 5.1 (Semantics: Support and Risk Ratio) of the MacroBase paper, risk ratio was defined as:
riskRatio = (a_o / (a_o + a_i) ) / ( b_o / (b_o + b_i) )

When translated using MacroBase source code variables, the risk ratio expression should be:
riskRatio = (exposedOutlierCount / totalExposedCount) / (exposedInlierCount / totalInlierCount)

In the MacroBase GUI Tutorial provided as part of the documentation, the following write-up is provided for risk ratio statistic. Please note that the example in the tutorial provides an incorrect risk ratio value due to the bug I'm reporting.

Ratio Out/In is the proportion of outlier records containing this attribute combination compared to the proportion of inlier records containing this attribute combination (i.e., support in outliers divided by support in inliers). A ratio of 1 means that this pattern appeared equally frequently in inlier and outliers. A ratio of infinity means this pattern was not present in the inliers.

Instead of the above expression, the following is seen (line 46-47).

macrobase/legacy/src/main/java/macrobase/analysis/summary/itemset/RiskRatio.java

Lines 35 to 49 in f41fc5c

    
               // all outliers had this pattern 
        
               if (unexposedOutlierCount == 0) { 
        
                   return  new RiskRatioResult(Double.POSITIVE_INFINITY); 
        
               } 
        
               double z = 2.0; 
        
               double correction = z*Math.sqrt( 
        
                       (exposedInlierCount / exposedOutlierCount)/totalExposedCount 
        
                       + (unexposedInlierCount / unexposedInlierCount)/totalMinusExposedCount 
        
               ); 
        
               return new RiskRatioResult((exposedOutlierCount / totalExposedCount) / 
        
                      (unexposedOutlierCount / totalMinusExposedCount), correction); 
        
           }

In addition to this bug, a boundary condition is implemented incorrectly (line 35-38). When all outliers have the pattern, risk ratio need not be INFINITY. Risk ratio becomes INFINITY only when none of the inliers have the pattern i.e., when exposedInlierCount == 0.

Fix

These bugs in Risk Ratio computation can be fixed by introducing the following modification for the legacy and lib source code files that compute Risk Ratio.

        if (exposedInlierCount == 0) {
            return  new RiskRatioResult(Double.POSITIVE_INFINITY);
        }

        double z = 2.0;
        double correction = z*Math.sqrt(
                (exposedInlierCount / exposedOutlierCount)/totalExposedCount
                + (unexposedInlierCount / unexposedInlierCount)/totalMinusExposedCount
        );

        return new RiskRatioResult((exposedOutlierCount / totalOutliers) /
                                   (exposedInlierCount / totalInliers), correction);

I also found two tests that need to be modified since they check boundary condition behavior incorrectly.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in risk ratio formula and boundary condition in legacy and lib source code files #266

Bug in risk ratio formula and boundary condition in legacy and lib source code files #266

ganesh-srinivas commented Jun 13, 2018

Bug in risk ratio formula and boundary condition in legacy and lib source code files #266

Bug in risk ratio formula and boundary condition in legacy and lib source code files #266

Comments

ganesh-srinivas commented Jun 13, 2018

tldr

Issue

Fix