Skip to content

Commit

Permalink
Add isolation forest description. #42
Browse files Browse the repository at this point in the history
  • Loading branch information
hannah-rae committed Sep 15, 2021
1 parent 30518ad commit 3699146
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 4 deletions.
13 changes: 9 additions & 4 deletions publications/iaai21/dora_outlier_detection_iaai.tex
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ \section{Methods}
is associated with a location defined by a geographic coordinate reference
system (e.g., latitude/longitude in degrees). Most satellite data are
distributed as rasters; common formats include GeoTIFF, NetCDF, and
\todo{HDF}. A data loader for each data type locates the data by the path(s)
HDF. A data loader for each data type locates the data by the path(s)
defined in the configuration file and loads samples into a dictionary of numpy
arrays indexed by the sample id. This \texttt{data\_dict} is then passed to each of the ranking algorithms specified in
the configuration file.
Expand Down Expand Up @@ -345,10 +345,15 @@ \section{Methods}
the background distribution.

\subparagraph{Sparsity.}
\todo{Describe isolation forest: Umaa.}
Sparsity-based methods score outliers based on how isolated or sparse they
are in the feature space. Isolation forests are a common sparsity-based method
that constructs many random binary trees from a
dataset~\cite{liu2008isolation}. The outlier score for
a sample is quantified as the average distance from the root to the item’s
leaf. Shorter distances are indicative of outliers because fewer splits were
required to isolate the sample.

\subparagraph{Likelihood.}
\todo{Describe PAE: Hannah or Bryce}
The negative sampling algorithm is implemented by converting the unsupervised
outlier ranking problem into a semi-supervised
problem~\citep{sipple:neg-sampling20}. Negative (anomalous)
Expand All @@ -357,7 +362,7 @@ \section{Methods}
negative and positive examples are then used to train a random forest
classifier. We use the posterior probabilities of the random forest classifier
as outlier scores, which means that the observations with higher posterior
probabilities are more likely to be outliers.
probabilities are more likely to be outliers. \todo{Describe PAE: Hannah or Bryce}

\paragraph{Results organization.}
Each of the outlier ranking algorithms returns an array containing the sample
Expand Down
9 changes: 9 additions & 0 deletions publications/iaai21/dora_references.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
@inproceedings{liu2008isolation,
title={Isolation forest},
author={Liu, Fei Tony and Ting, Kai Ming and Zhou, Zhi-Hua},
booktitle={2008 eighth ieee international conference on data mining},
pages={413--422},
year={2008},
organization={IEEE}
}

@article{molero2013analysis,
title={Analysis and optimizations of global and local versions of the RX algorithm for anomaly detection in hyperspectral data},
author={Molero, Jos{\'e} Manuel and Garzon, Ester M and Garcia, Inmaculada and Plaza, Antonio},
Expand Down

0 comments on commit 3699146

Please sign in to comment.