-
Notifications
You must be signed in to change notification settings - Fork 0
Project Overview
The goal of this article is to explain at a high-level the step-by-step process that was followed for this project in order to produce the end results. This should aid anyone looking to replicate or modify the project for their own use.
At the highest level, the goal of this project was to expand the number of food items recognizable by a feeding arm. Prior to this project the camera had the capacity to recognize less than 10 objects and with poor accuracy. By termination, our new model was able to recognize more than 30 food items with higher accuracy. In the following sections, I will broadly detail what steps were taken to achieve this result.
From the genesis of this project, it was decided that we would use a machine learning model to recognize these different types of food items. But before we could build a model to recognize food items, we first needed to know which food items we wanted to recognize. To do so, a small cohort of patients were surveyed on the types of food items they were most interested in eating, giving us a list of ~35 food items to target for recognition.
With food items known, we traveled to the grocery store and purchased one of each. Food items were either bite sized to begin with (e.g. grapes) or were shaped into bite-sized portions (e.g. avocado). To then get training data for our model we then ran a series of 45 experiments. In each, a random color plate was selected along with a random subset of the food items. The number of food items on the plate was always greater than 5 and never more than 15. We then oriented the Kinova 6DOF arm above the plate and had it move through a training cycle (i.e. a list of predefined poses). The goal here was to create varied training data to force the model to learn to truly recognize food items and not overfit the training data. Each training cycle was around 2.5 minutes and for each we collected color data, depth data, and robot joint states for forward kinematics. In total we ended up with around 2.5 hours of data sampled at 60hz across 2.5TB of hard disk space.
Given the previous figure, we determined that manually labeling the data was neither possible from a time or cost perspective. To elaborate, we estimated it would be months worth of labeling through a service like AWS MTurk with a cost exceeding $20,000. However, it was still a requirement that we label at least a significant portion of these frames. From some brainstorming we came up with an algorithm to automatically label the data from just a single annotation.
To describe this algorithm, suppose we are given a 2D picture of a sample plate of food with segmentations of the food items provided. Given that we also have the depth data associated with that frame, we could construct a 3D visualization of the scene with the segmented points now existing in 3D space. Then given a transformation from one robot arm state to another, it is just a matter of multiplying by a transformation matrix to determine where these 3D points will now exist relative to the new position of the arm. We can then reproject our 3D view back to 2D in order to have our segmentations for the new frame.
This logic is all implemented in scripts/annotate.py
. For each of the above experiments we segment only the first frame using an external tool. The script then loads this segmentation and uses the initial robot's position to generate a 3D representation of the space. Then on each subsequent frame we propagate our segmentations and save the results into a training directory. This method reduces months of manual annotation to a few hours and works perfectly.
Well, it works perfectly in theory anyways. As we were testing this algorithm we started to notice a big problem. While in some frames our propagated segmentations were perfect, in others the masks were slightly off typically just "sliding" a little bit too far off the food item. This is incredibly problematic because, as we discovered, feeding this poor data into a ML model causes it to struggle to properly learn what is and is not a food item. In testing this then predicted segmentations around the food item which contained parts of the plate or parts of other food items. Given that the downstream task from this one is skewering the food item with a fork, inaccurate predictions about the location of a food item is detrimental to the success of the task.
We tried a variety of methods to rectify this issue but they were all outright failures. If repeating this experiment in the future, it is imperative that the camera is properly calibrated prior to collecting the experiment data as small errors here propagate very far.
The solution finally came via Meta's new model segment-anything . At the time of writing, this model is state-of-the-art at visually segmenting an unknown scene. The catch, of course, is that the model has no knowledge of the objects it is segmenting. That is to say it can perfectly segment an object on the table like a cup but it doesn't know this particular object is a cup.
Immediately out of the box this model worked incredibly well on our data. The food items were well-segmented and the bounds were tight. So the question at this point was: how can we leverage the approximate object segmentations we have from above and this new SAM to generate higher quality training data?
The algorithm we wrote works like this: For a given frame we load in our approximate segmentations and then use SAM to predict the segmentations for the scene. We filter out SAM segmentations that are too big or look too much like shadows and then find the nearest approximate segmentation. If the distance between the two isn't too large, we assign the respective label. Written another way, for each SAM segmentation we assign the label of the closest approximate segmentation.
This worked incredibly well with the only downside being that the correction took a long time to run. Now with high-quality training data we could finally train a model.
Surprisingly for a project with the primary aim of recognizing food, a disproportional amount of time was actually spent building the model. Given prior success, we opted to use the MaskRCNN architecture in PyTorch with a modified output layer to correspond to our 35 total predictable classes. Training with CUDA took a few hours with the largest bottleneck actually being loading examples into memory.
To evaluate the model, we used the metrics of Mean-Average Precision (mAP) and Mean-Average Recall (mAR). Our final, deployed model had respective values of 0.572 and 0.600 which are quite high for a multiclass recognition task such as this one.
The above numbers are a useful piece of information but they don't give a good understanding of the performance of the model. Below is a video of the model running on the ROS node (described below). In terms of object classification, the results are good but variable. Some objects like pepperoni which are far in feature space from others are identified perfectly every time. Others like pieces of fruit can sometimes suffer due to similar objects existing in the dataset. However the more important pieces are the bounding boxes which tightly hug the food items exactly as we want. This demonstrates the model as an effective tool for the downstream task of skewering.
VIDEO??
Lastly, with a working model the code had to be fit into an actual ROS node in order to read in images and publish predictions. This logic is all written in the food_detector module which orchestrates everything related to this process.