-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic Labeling #203
base: ros2-devel
Are you sure you want to change the base?
Semantic Labeling #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good overall. Could likely use some cleanup but the core functionality seems to make sense. I would also check for pylint
warnings if you haven't since I noticed many macros in the old code that this PR was based off of.
# Environment Variables file | ||
.env | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a .env
file added in this PR, but I'm guessing this was more for personal use. I'd recommend omitting this change unless it's relevant for the functionality of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed more references to a env
file later in the code, where exactly does this come into play?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm adding environment variable functionality to our codebase so we can privately store API keys without exposing them publicly in github. In this particular case, it is for accessing the PRL OpenAI API key to invoke GPT-4o.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this may come in handy later on as well if we power perception w/ foundation models in the future.
# # If you need to send a fixed food frame to the robot arm, e.g., to | ||
# # If you need to send a fixed food frame to the robot arm, e.g., to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unintended?
# The list of input semantic labels for the food items on the plate | ||
string caption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment seems misleading, I suspect this was an old comment for item_labels
@@ -2,5 +2,8 @@ pyrealsense2 | |||
overrides | |||
sounddevice | |||
scikit-spatial | |||
openai | |||
python-dotenv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to another comment on .gitignore
. Are we using .env
files for something?
- /camera/color/image_raw/compressed | ||
- /camera/color/camera_info | ||
- /camera/aligned_depth_to_color/image_raw/compressedDepth | ||
- /camera/aligned_depth_to_color/camera_info | ||
- /local/camera/aligned_depth_to_color/image_raw/compressedDepth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between the topic starting with /local
and the ones you have added? Does republishing several streams of image data significantly increase the load on our network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems like the base branch of this PR has not been updated. Because all these extra topics got removed in a push a few months ago https://github.com/personalrobotics/ada_feeding/blob/ros2-devel/ada_feeding_perception/config/republisher.yaml . @sriramk117 if you pull the updated main
and then rebase or merge on top of it, that should address this (non-)change
boxes_xyxy[phrase].append([x0, y0, x1, y1]) | ||
|
||
# Measure the elapsed time running GroundingDINO on the image prompt | ||
inference_time = int(round((time.time() - inference_time) * 1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be logged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why are you multiplying by 1000?
In general, I find it more readable to specify the units of measurements after the variable name itself, e.g., in this case maybe it should be inference_time_ms
given the factor of 1000
?
center_x = (bbox[0] + bbox[2]) // 2 | ||
center_y = (bbox[1] + bbox[3]) // 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and other instances of bbox
, does it have any properties like xmin
, xmax
, ymin
, ymax
to make this easily understandable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You coudl consider using a namedtuple for this
mask_msg.average_depth = median_depth_mm / 1000.0 | ||
mask_msg.item_id = item_id | ||
mask_msg.object_id = object_id | ||
mask_msg.confidence = float(score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old code for SegmentFromPoint
didn't require a type-cast to float
, it's fine if this is needed just wanted to check since I noticed the discrepancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, these small changes are precisely why its best to not copy-paste code (e.g., if casting to float
improves the code here, we should ideally percolate it back to SegmentFromPoint
). So this is one more of the functions that should be in a helper.
Another thing you can consider is making SegmentAllItems
inherit from the SegmentFromPoint
class, and only override some functions.
result: The result message containing masks for all food items detected in the image | ||
paired with semantic labels. | ||
""" | ||
self._node.get_logger().info("Received a new goal!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention the goal_handle
here
self._node.get_logger().info("Goal not cancelled.") | ||
self._node.get_logger().info("VIsion pipeline completed successfully.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also known as "Goal succeeded", also typo in "VIsion"
--- | ||
# A sentence caption compiling the semantic labels used as a query for | ||
# GroundingDINO to perform bounding box detections. | ||
string caption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add newlines at the end of files. (I know not all files have it, but in general it is a best practice so we should enforce it on new/modified files)
- /camera/color/image_raw/compressed | ||
- /camera/color/camera_info | ||
- /camera/aligned_depth_to_color/image_raw/compressedDepth | ||
- /camera/aligned_depth_to_color/camera_info | ||
- /local/camera/aligned_depth_to_color/image_raw/compressedDepth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems like the base branch of this PR has not been updated. Because all these extra topics got removed in a push a few months ago https://github.com/personalrobotics/ada_feeding/blob/ros2-devel/ada_feeding_perception/config/republisher.yaml . @sriramk117 if you pull the updated main
and then rebase or merge on top of it, that should address this (non-)change
# A boolean to determine whether to visualize the bounding box predictions | ||
# made by GroundingDINO | ||
viz_groundingdino: false | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: newline.
I think your empty line at the end of the file has some whitespae, which is why Github doesn't recognize it as the empty line at the end of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: does this belong in model
or config
? This seems more like configuration. I believe what is downloaded to model
is the actual files storing model weights.
# Third-party imports | ||
import cv2 | ||
import time | ||
import random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some of these, like time
and random
, are "standaard imports." pylint
should help take care of some of these issues. Eventually, I want to set up precommit
on this repo so reformatting and some level of linting happens automatically
if self.viz_groundingdino: | ||
self.visualize_groundingdino_results(image, bbox_predictions) | ||
|
||
# Collect the top contender mask for each food item label detected by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest having a parameter to the function (set by default to 1
) that controls how many of the top masks you use per food item
result.item_labels = item_labels | ||
|
||
# Measure the elapsed time running GroundingDINO on the image prompt | ||
inference_time = int(round((time.time() - inference_time) * 1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above re. ms vs sec and adding units to variable names
caption = goal_handle.request.caption | ||
|
||
# Create a rate object to control the rate of the vision pipeline | ||
rate = self._node.create_rate(self.rate_hz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently learned that rates live forever in a node unless explicitly deleted by destroy_rate
(see comment here). So I'd recommend adding that at the end of this function, else the rate will keep taking up callback resources even after the action has finished.
while ( | ||
rclpy.ok() | ||
and not goal_handle.is_cancel_requested | ||
and not vision_pipeline_task.done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where the vision pipeline could hang? If so, I'd recommend adding a timeout to the action (maybe in the action message itself) to be robust to that
result.status = result.STATUS_CANCELLED | ||
|
||
# Clear the active goal | ||
with self.active_goal_request_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do this in the cleanup function itself? (both here and below)
Description
Overview
This PR implements an ADA feeding perception node that enables semantic labeling. Semantic labeling acts as infrastructure in the perception code for advances to the user interface including: (1) allowing natural language prompting for ADA users, (2) auto-feeding features that give the robot the ability to continue to feed a specific food item on the plate until it is no longer there without user prompting, and (3) taking user preference (i.e. how they want to eat their meal) into account during bite selection.
ROS2 Interfaces
Firstly, the node exposes a service (
~/invoke_gpt4o
) that takes in a list of string labels describing the food items on the plate (for example, ['meat', 'cucumber']) and then, runs GPT-4o to compile these string labels into a visually-descriptive, sentence query for GroundingDINO (for example, "Food items including strips of grilled meat and seasoned cucumber spears arranged on a light gray plate.").The node also creates an action server (SegmentAllItems) that runs a pipeline consisting of a vision language model called GroundingDINO and the segmentation model EfficientSAM/SAM to output masks paired with semantic labels describing the masks for each of the detected food items on the plate.
Implementation Details
The perception node uses three foundation models to achieve semantic labeling: GPT-4o, GroundingDINO, and EfficientSAM/SegmentAnything.
From experimentation with prompting techniques on GroundingDINO and GroundingDINO's feature extraction functionality, we determined that prompting GroundingDINO with a sentence prompt that contains descriptive qualifiers results in significantly better performance for bounding box detection. Therefore, we query GPT-4o with the latest above plate image and a list of strings by the user labeling the food items on the plate to generate a visually-descriptive sentence incorporating those labels to describe the image. This sentence then gets used as input for GroundingDINO.
GroundingDINO is a vision language model that detects objects in an image (through bounding boxes) given natural language prompts. We use GroundingDINO as it pairs semantic labels from the natural language prompt with each of the detected bounding boxes.
Then, we use the segmentation model EfficientSAM/SAM to extract pixel-wise masks of each of the detected objects given their bounding boxes.
Here, we share a couple images of GroundingDINO's bounding box detection performance on various above plate image photos:
Testing procedure
Test GroundingDINO + SegmentAnything vision pipeline:
We can test the GroundingDINO + SegmentAnything vision pipeline by running the web app with dummy nodes.
Follow these steps:
cd src/feeding_web_interface/feedingwebapp
npm run start
node --env-file=.env server.js
node start_robot_browser.js
ros2 launch ada_feeding_perception ada_feeding_perception.launch.py
ros2 launch feeding_web_app_ros2_test feeding_web_app_dummy_nodes_launch.xml run_food_detection:=false run_face_detection:=false
ros2 action send_goal /SegmentAllItems ada_feeding_msgs/action/SegmentAllItems "caption: Some input sentence query for the pipeline that visually describes the food items on the plate."
ros2 service call /segment_all_items/invoke_gpt4o ada_feeding_msgs/srv/GenerateCaption "input_labels: ['grape', 'cantaloupe', 'carrot', 'blueberry', 'strawberry']"
Test GPT-4o service:
ros2 launch ada_feeding_perception ada_feeding_perception.launch.py
ros2 service call /segment_all_items/invoke_gpt4o ada_feeding_msgs/srv/GenerateCaption "input_labels: ['grape', 'cantaloupe', 'carrot', 'blueberry', 'strawberry']"
Before opening a pull request
python3 -m black .
ada_feeding
directory, run:pylint --recursive=y --rcfile=.pylintrc .
.Before Merging
Squash & Merge