Skip to content

[ICLR 2024 Oral] Less is More: Fewer Interpretable Region via Submodular Subset Selection

Notifications You must be signed in to change notification settings

RuoyuChen10/SMDL-Attribution

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for latest update.

arXiv hf_space Python 3.8 Pytorch 1.12.1 License CC BY-NC

PWC

PWC

PWC

PWC

📰 News & Update

  • [2024.08.22] Our approach now supports audio attribtuon on foundation model, we use ImageBind as an example! Welcome to try it according to the tutorial!
  • [2024.06.16] Our approach now supports medical multimodal model Quilt interpretation! Welcome to try it according to the tutorial!
  • [2024.06.04] Our approach now supports multi-gpus interpretation proccessing, please refer to the ./scripts fold!
  • [2024.06.04] Our approach now supports CLIP interpretation! Welcome to try it according to the tutorial!
  • [2024.04.22] Our approach now supports LanguageBind interpretation! Welcome to try it according to the tutorial!
  • [2024.04.11] Our approach now supports multi-modal models with ViT as backbone (ImageBind, Pytorch only)! Welcome to try it according to the tutorial!
  • [2024.01.17] The original code is available now! Welcome to try it according to the tutorial!
  • [2024.01.16] The paper has been accepted by ICLR 2024 and selected for oral presentation!

🛠️ Environment (Updating)

Our method will both support keras and pytorch two deep learning frameworks. You can first install pytorch.

opencv-python
opencv-contrib-python
mtutils
tqdm
scipy
scikit-learn
scikit-image
matplotlib==3.7.1
seaborn==3.7.1
xplique>=1.0.3

Our original code is based on Keras, and the method of verification on the ViT model will be completely dependent on Pytorch.

conda create -n smdl python=3.10
conda activate smdl
python3 -m pip install tensorflow[and-cuda]

pip install git+https://github.com/facebookresearch/segment-anything.git

🐳 Model Zoo

Note: Our method will no more support TensorFlow/Keras, but focus on PyTorch.

Recognition Models (Please download and put the models to the path ckpt/keras_model):

Datasets Model
Celeb-A keras-ArcFace-R100-Celeb-A.h5
VGG-Face2 keras-ArcFace-R100-VGGFace2.h5
CUB-200-2011 cub-resnet101.h5, cub-resnet101-new.h5, cub-efficientnetv2m.h5, cub-mobilenetv2.h5, cub-vgg19.h5

Uncertainty Estimation Models (Please download and put the models to the path ckpt/pytorch_model):

Datasets Model
Celeb-A edl-101-10177.pth
VGG-Face2 edl-101-8631.pth
CUB-200-2011 cub-resnet101-edl.pth

😮 Highlights

Sub-Region Division Method Attribution Visualization Org. Prediction Score Highest Prediction Score Insertion AUC Score
SLICO 0.7262 0.9522 0.7604
SEEDS 0.7262 0.9918 0.8862
Prior Saliency Map + Patch 0.7262 0.9710 0.7236
Segment Anything Model 0.7262 0.9523 0.6803

Audio classification (on multimodal foundation model ImageBind) attribution:

Medical multimodal model debugging:

🗝️ How to Run (Updating)

If you want to see how to apply this to your own model, please refer to the jupyter notebooks in ./tutorial/ first.

Note: We first publish how to evaluate attribution for multimodal models and how to evaluate it.

Multi GPUs, please refer to the ./scripts fold, for example:

./scripts/clip_multigpu.sh

Then, you may get a saved intermediate result in the path submodular_results/imagenet-clip-vitl/slico-0.0-0.05-1.0-1.0.

Evaluate the Insertion and Deletion metrics:

python -m evals.eval_AUC_faithfulness --explanation-dir submodular_results/imagenet-clip-vitl/slico-0.0-0.05-1.0-1.0

you may get the results:

Insertion AUC Score: 0.7550
Deletion AUC Score: 0.0814

👍 Acknowledgement

Xplique: a Neural Networks Explainability Toolbox

Score-CAM: a third-party implementation with Keras.

Segment-Anything: a new AI model from Meta AI that can "cut out" any object, in any image, with a single click.

CLIP: a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task

ImageBind: ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.

LanguageBind: LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics.

✏️ Citation

@inproceedings{chen2024less,
  title={Less is More: Fewer Interpretable Region via Submodular Subset Selection},
  author={Chen, Ruoyu and Zhang, Hua and Liang, Siyuan and Li, Jingzhi and Cao, Xiaochun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}