-
Notifications
You must be signed in to change notification settings - Fork 867
Twitter Anomaly Detector
AnomalyDetection is an R package developed by Twitter that detects anomalies in time-series data. The package implements their Seasonal Hybrid ESD algorithm, which extends the generalized ESD algorithm to allow for seasonality in the data, i.e. different periods of patterns in the data that represent macro-level changes rather than micro-level anomalies.
To evaluate AnomalyDetection, written in R, on NAB, written in Python, we have three options: port the R code into Python, use an interface from R to Python like rpy2, or use the R code for anomaly detection and the Python code for evaluating the results. We elected to go with the third option, following "Path 3" in this NAB figure. Thus the task reduced to converting the NAB data files into structures as expected by AnomalyDetection, and then converting the output of AnomalyDetection into the results format required by NAB.
We provide a gist of the R script we used to run the AnomlayDetection algorithms on NAB, which includes a few subtelties detailed below.
As specified in the NAB technical whitepaper, datasets in NAB are CSV files with a "timestamp" column and a "value" column. The values are floats or integers, and the timestamps are strings of the form YYYY-mm-dd HH:MM:SS.s
(in Python notation). In R notation, the timestamps are of the form %Y-%m-%d %H:%M:%OS
. R provides a read.csv
function to load NAB data into a dataframe that AnomalyDetection can use. Converting the timestamps in the CSV file to the appropriate datatype in R requires a bit of subtlety. With the path to the CSV file stored in dataFilePath
,
setClass("nabDate")
setAs("character", "nabDate", function(from) as.POSIXlt(from, format="%Y-%m-%d %H:%M:%OS"))
nab_data <- read.csv(dataFilePath, colClasses=c("nabDate", "numeric"))
Now nab_data
can be passed into the AnomalyDetection functions.
The Ts version of AnomalyDetection is intended to use the periodocity in time series data to supplement the underlying algorithms. However, we found the algorithm failed to detect the necessory periodicity params for a large subset of the NAB data files. Researching the errors revealed open issues in the AnomalyDetection source code*, where the recommended course of action is to defer to the Vec version. Therefore we do not include the Ts version in the NAB results.
*There are two error statements and corresponding AnomalyDetection issues: "Anom detection needs at least 2 periods worth of data" and "must supply period length for time series decomposition".
We tuned the parameters of AnomalyDetectionVec to yield the best NAB results possible (across all application profiles), and the AnomalyDetectionTs parameters in an attempt to run it effectively on most of the dataset.
The parameters of significant consequence to the results of AnomalyDetectionVec are period
and max_anoms
. The former defines the number of records in a single period (used in seasonal decomposition), and the latter captures the maximum percent of data points that will be labelled as anomalous by the algorithm. We tuned these parameters manually in search of the best final scores, finding period=150
and max_anoms = 0.0020
maximize the scores for all three NAB application profiles (standard, reward low FP, reward low FN).
To prepare NAB for analyzing results from a new detector, we ran the following script:
python scripts/create_new_detector.py --detector twitterADVec
This script generates the necessary directories and creates an entry in the thresholds JSON.
NAB requires a CSV file with timestamp, value, anomaly_score, and label columns, so we want to add these columns to our nab_data
data frame. Because AnomalyDetection identifies anomalies, rather than reporting an anomaly probability or a raw score for each record, we used a binary anomaly_score: the records flagged by AnomalyDetection as anomalous are represented by 1, and all others 0. The label column is also binary, indicating whether or not a record is within a true anomaly window. The true anomalies and their durations are recorded in a JSON file of the combined windows. We used the jsonlite R package for handling the JSON.
With all columns added to the dataframe, write.csv
lets us write the results to a CSV file that can be passed into NAB. Note: Each CSV file must have the name of the detector followed by an underscore at the beginning of the filename, e.g. twitterADVec_cpu_utilization_asg_misconfiguration.csv
.
This is implemented in addDetections()
and addLabels()
of our gist for running AnomalyDetection.
The results CSV files were placed in NAB/results/twitterADVec/ in categorical subdirectories. Now we're ready to score the results, and in the top level of NAB we run:
python run.py -d twitterADVec --optimize (optional) --score --normalize
This runs the scoring and normalization step for the twitterADVec detector. The optimization step is optional because we can manually set the thresholds (for all application profiles) arbitrarily between 0 and 1. That is, because the anomaly_score entries are binary, we can use a threshold of 0.5 and skip optimization.
The final scores will be printed to the screen and written to nab/results/final_results.json, and results CSV files for each application profile will be written to the twitterADVec directory. We obtained the following output for AnomalyDetectionVec with optimized parameters:
Final score for 'twitterADVec_reward_low_FP_rate_scores' = 33.61
Final score for 'twitterADVec_reward_low_FN_rate_scores' = 53.50
Final score for 'twitterADVec_standard_scores' = 47.06