Trimming outliers using trees: Winning solution of the Large-scale Energy Anomaly Detection (LEAD) competition
In this repository, you can find script notebooks of the winning solution in notebooks folder, presentation slides,and the paper detailing the modeling framework.
https://www.kaggle.com/competitions/energy-anomaly-detection
- Data preprocessing
- No anomalies were removed because the goal of this contest is anomaly detection
- Missing values (NaN) were replaced with the median value of each time series
- Feature enegineering
- Building meta data and weather data
- Temporal features (e.g., hour, weekday, and day of year)
- Target encoding features (ref: preprocessing script from 1st place team in GEPIII)
- Value-change features: calculate change of value compared to nearby values (e.g., X(t)-X(t-1) and X(t)/X(t-1)) with varying shift steps (from 1 hour to 168 hours))
- Features from data smoothing and k-means clustering were also tried, but they don’t appear to significantly improve the score
- Modeling
- Train/valid split by building_id to ensure the valid data were unseen during training
- Downsampling training dataset to solve data imbalance (~5% of anomalies)
- Model ensembling via simple averaging: XGBosst, LightGBM, CatBoost, and HistGradientBoosting (weight of 0.25 for each)
- Postprocessing
- Set zeros to rows with 1.0 of meter_reading
- Set zeros to start and end points of time series