Grounding DINO + SAM (ref: https://huggingface.co/docs/transformers/model_doc/grounding-dino)
Grounding DINO is an advanced object detection model that extends the capabilities of a closed-set object detector, DINO (DEtection with INterpolation Optimized), by integrating it with a text encoder. This enables the model to perform open-set object detection, which means it can detect arbitrary objects based on human inputs such as category names or referring expressions. The key innovation in GDINO is the effective fusion of language and vision modalities, accomplished through three main components:
- Feature Enhancer: Enhances the input features using both self-attention and cross-attention mechanisms.
- Language-Guided Query Selection: Selects queries during the detection process using language information.
- Cross-Modality Decoder: Uses cross-attention layers to improve query representations, ensuring better fusion of textual and visual information.
GDINO achieves remarkable results on various benchmarks, including COCO, LVIS, ODinW, and RefCOCO/+/g, demonstrating its ability to generalize to novel object categories and perform well in zero-shot scenarios (e.g., achieving 52.5 AP on COCO zero-shot transfer without any COCO training data).
Grounded SAM is an approach that combines Grounding DINO with the Segment Anything Model (SAM) for text-based mask generation. SAM is designed to handle segmentation tasks by segmenting objects in images, often referred to as the " segment anything" model due to its broad applicability. By integrating GDINO with SAM, Grounded SAM leverages the strengths of both models:
- GDINO provides the ability to detect objects based on text descriptions.
- SAM performs precise segmentation of these detected objects.
This combination allows users to generate masks for objects specified through natural language, making it a powerful tool for diverse visual tasks where precise object localization and segmentation based on textual descriptions are required.
Here's an example of how to use GDINO for zero-shot object detection:
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
model_id = "IDEA-Research/grounding-dino-tiny"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to('cuda')
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
text = "a cat. a remote control."
inputs = processor(images=image, text=text, return_tensors="pt").to('cuda')
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_grounded_object_detection(
outputs,
inputs.input_ids,
box_threshold=0.4,
text_threshold=0.3,
target_sizes=[image.size[::-1]]
)
This script sets up the processor and model, processes an image and text pair, and performs object detection, post-processing the outputs to get the final detection results.
According to this notebook, you can combine Grounding DINO with the Segment Anything (SAM) model for text-based mask generation by following these steps:
- Use Grounding DINO to detect the given set of text labels in the image and obtain bounding boxes for the detected objects.
- Provide the image and the bounding boxes from Grounding DINO to the Segment Anything (SAM) model.
- The SAM model will generate segmentation masks corresponding to the provided bounding boxes.
The key functions defined in the notebook are:
-
detect(image, labels, threshold, detector_id)
: This function uses Grounding DINO to detect the givenlabels
in theimage
and returns a list ofDetectionResult
objects containing the bounding boxes and scores. -
segment(image, detection_results, polygon_refinement, segmenter_id)
: This function takes theimage
and thedetection_results
from Grounding DINO, and uses the Segment Anything (SAM) model to generate segmentation masks for the bounding boxes. It returns theDetectionResult
objects with the masks added. -
grounded_segmentation(image, labels, threshold, polygon_refinement, detector_id, segmenter_id)
: This is the main function that combines the above two functions. It first callsdetect
to get the bounding boxes from Grounding DINO, and then callssegment
to generate the masks using SAM. It returns the original image and the list ofDetectionResult
objects with masks.