Skip to content

Latest commit

 

History

History
 
 

ModelEvaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Model Evaluation

The following process is used to score different models on Orcasound test sets.

We use the AU-PRC metric - details are described in Methodology.

Steps

  1. Get test data with download_datasets.py from the orcaml repo by specifying the --only_test flag. (For details about the test sets see Orcasound data wiki)
  2. Run inference with your model and create a submission file following Submission Format
  3. Run score.py to get the results, as well as detailed Precision-Recall curves
  4. You can score multiple submission files together, to easily compare different models

Submission Format

Your submission should contain a series of time-intervals, with associated confidence scores for each wav_filename in the test set.

The intervals can have any duration, but should be non-overlapping and together cover the entire wav file. If any time intervals are unmarked, they are assumed to be zero confidence.

Times do not need to be too accurate: they are quantized to 1-second level precision. NOTE: you should not apply any thresholding - this is part of the scorer.

Specifically, we need a tsv with these columns:

  • wav_filename - namesake
  • start_time_s - specified relative to the audio file (in sec)
  • duration_s - duration of the interval (in sec)
  • confidence - confidence score, which is used to generate the AU-PRC metrics and curve

For an example, see submission/AudioSet-VGGish-R1to7.tsv.

Methodology

We quantize intervals from both the ground truth and the submission file into 1 second time windows.

If N:N+1 seconds contains a part of the interval, N is counted.

These quantized intervals are then treated as individual examples for generating the AU-PRC as the evaluation metric.

The AU-PRC is computed individually for each sub-dataset and a simple average is taken for the OVERALL score.

Example

This runs scoring for the baseline, and the current best model.

python score.py -testSetDir [DOWNLOAD_DATASETS] -submissionFiles "submission\Baseline-AudioSet-VGGish_R1to7.tsv,submission\FastAI-ResNet50_R1to7.tsv" -threshold (OPTIONAL)

A results.md file containing a summary, au_pr_curves.png containing plots and metrics.tsv containing details is written to the directory containing the submission files.

If the optional -threshold argument, is provided, precision/recall/F1 scores are also included in metrics.tsv. (Note: this is only for development purposes, the official metric is threshold-independent AUPRC)

Results

This the the current state-of-art for models on the repo :)

dataset Baseline-AudioSet-VGGish_R1to7 FastAI-ResNet50_R1to7 Baseline-AudioSet-VGGish_R1to12 FastAI-ResNet50_R1to12
OVERALL 0.614 0.836 0.681 0.872
podcast_test_round1 0.949 0.979 0.939 0.977
podcast_test_round2 0.803 0.923 0.834 0.938
podcast_test_round3 0.09 0.605 0.269 0.700

Precision-Recall-plots

NOTE: If you are deploying a new model (1) generate your submission file and score, comparing with existing files (2) update /submission and this README with your results (3) Upload your model (with similar naming convention) to folder and update links in the README.

NOTE: The FastAI-ResNet50_R1to12 model was trained with extra false positive data from the live system apart from the Round 1-12 training dataset.