You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Risk Ratio formula and one boundary condition is implemented incorrectly. A fix is provided below. I can do a Pull Request if you want me to.
Issue
In Section 5.1 (Semantics: Support and Risk Ratio) of the MacroBase paper, risk ratio was defined as: riskRatio = (a_o / (a_o + a_i) ) / ( b_o / (b_o + b_i) )
When translated using MacroBase source code variables, the risk ratio expression should be: riskRatio = (exposedOutlierCount / totalExposedCount) / (exposedInlierCount / totalInlierCount)
In the MacroBase GUI Tutorial provided as part of the documentation, the following write-up is provided for risk ratio statistic. Please note that the example in the tutorial provides an incorrect risk ratio value due to the bug I'm reporting.
Ratio Out/In is the proportion of outlier records containing this attribute combination compared to the proportion of inlier records containing this attribute combination (i.e., support in outliers divided by support in inliers). A ratio of 1 means that this pattern appeared equally frequently in inlier and outliers. A ratio of infinity means this pattern was not present in the inliers.
Instead of the above expression, the following is seen (line 46-47).
In addition to this bug, a boundary condition is implemented incorrectly (line 35-38). When all outliers have the pattern, risk ratio need not be INFINITY. Risk ratio becomes INFINITY only when none of the inliers have the pattern i.e., when exposedInlierCount == 0.
Fix
These bugs in Risk Ratio computation can be fixed by introducing the following modification for the legacy and lib source code files that compute Risk Ratio.
tldr
Risk Ratio formula and one boundary condition is implemented incorrectly. A fix is provided below. I can do a Pull Request if you want me to.
Issue
In Section 5.1 (Semantics: Support and Risk Ratio) of the MacroBase paper, risk ratio was defined as:
riskRatio = (a_o / (a_o + a_i) ) / ( b_o / (b_o + b_i) )
When translated using MacroBase source code variables, the risk ratio expression should be:
riskRatio = (exposedOutlierCount / totalExposedCount) / (exposedInlierCount / totalInlierCount)
In the MacroBase GUI Tutorial provided as part of the documentation, the following write-up is provided for risk ratio statistic. Please note that the example in the tutorial provides an incorrect risk ratio value due to the bug I'm reporting.
Instead of the above expression, the following is seen (line 46-47).
macrobase/legacy/src/main/java/macrobase/analysis/summary/itemset/RiskRatio.java
Lines 35 to 49 in f41fc5c
In addition to this bug, a boundary condition is implemented incorrectly (line 35-38). When all outliers have the pattern, risk ratio need not be INFINITY. Risk ratio becomes INFINITY only when none of the inliers have the pattern i.e., when exposedInlierCount == 0.
Fix
These bugs in Risk Ratio computation can be fixed by introducing the following modification for the legacy and lib source code files that compute Risk Ratio.
I also found two tests that need to be modified since they check boundary condition behavior incorrectly.
The text was updated successfully, but these errors were encountered: