At OSRF, we are in the process of defining a new standard set of ROS messages for the computer vision community, and we'd like your help. This need was identified from our computer vision [survey](https://discourse.ros.org/t/survey-computer-vision-in-ros-gazebo/1607/2) as a first step towards improving the ROS computer vision ecosystem, so thank you for the feedback!
The end result of this effort may be a new message package in `common_msgs`, a REP, or both. Our goal is to capture as many common computer vision use cases as possible, with the exception of navigation. (We feel that navigation and localization are already well-defined by the community and [REP 105](http://www.ros.org/reps/rep-0105.html).) Object recognition and image classification are two primary targets we are hoping to hit, and we want to cover both 2D and 3D use cases.
The [repository](https://github.com/Kukanani/vision_msgs_proposal) we have created is very much a work in progress, and only with your feedback can we make it better. Any feedback is welcome, bu here are a couple of questions I have identified:
1. Are there major use cases or edge cases not covered by this set of messages?
2. Is this set of messages broad enough to encompass both handcrafted and machine learning-based approaches to computer vision?
I suppose this is a larger question of semantics or ML ecology, but perhap I'm of the thought that classifications derive from detections, such as ROI's, as opposed to viersa. In whichever case, I think the relationship should be made explicit if we are starting to nest standard message types.
when talking computer vision message standards one thing comes to mind, features.
It is not unusual to have different nodes exploiting the same kind of features (think SIFT/SURF etc), so that rather than extracting several time the same features, a single 'extraction' node does the job and publishes them. They are in turns exploited by (several?) others. A standard message may not be straightforward and I'm not sure this problem fits the scope of this proposal, however it certainly would be useful.
It's not clear to me if the proposal supports per-pixel segmentations. There is the `source` field of `Classification2D` and `Classification3D` which might be used for segmentations, but from the documentation I'm not entirely sure if it is also meant for segmentations or not. The name `source` is a bit confusing to me.
There might also be other types of detection besides bounding boxes and segmentation that I'm not currently thinking of, though the two seem like a pretty solid base for now.
# ROS parameter name where the metadata database is stored in XML format.
# The exact information stored in the database is left up to the user.
Why XML? Why not YAML or JSON, or completely implementation defined? Or what about the name of a tree of parameters on the ROS parameter server?
No problem @Loy, it's just that I have a specific use case in mind. It is as follows:
Local features (a point and a descriptor -> SIFT/SURF etc) are one of the basic component of CP and is used for geometry algos as well as for appearance-based algos. In feature-based Visual-SLAM (e.g. ORB-SLAM) you rely on feature both for the poses estimation (geometry) and place recognition (appearance). Those two tasks can be executed in parallel threads. Assuming you are using the same features for both tasks one could communicate a `Features.msg` (or such) to the other.
It is something (a `Features.msg`) I have been hackily doing here and there, feeding different classifiers - different processes for that matter.
I am just wondering here if a standardized way of moving such objects around would not make sense ?
ps : To be fair local features are also used from other sensor readings (e.g. laser scan, point cloud) so my question my be a little out of the scope of this thread.
Thanks for the awesome feedback, everyone! I'll try to address everything that was brought up.
First, let me start off by noting that although I only created Classification and Detection messages, I think it makes sense to keep this as a general `vision_msgs` package, and additional computer vision-related messages can be added as time goes on. I think it's more useful than making a too-specific `classification_msgs` or similar.
@reinzor, thanks for linking to your message definitions! I think that annotations are already covered under the existing implementation. You could provide the bounding box coordinates in a `Detection` message, and the most likely label as the only result in the class probabilities. If we want to add other information, such as color of the outline, etc. then maybe this would be a better fit for `visualization_msgs` or another package.
On another note, is human pose estimation standardized enough to make a custom message type for it? Or is it best described by a TF tree, arbitrary set of `geometry_msgs/Pose`, or some other existing ROS construct? I'm thinking of the fact that different human detectors provide different levels of fidelity, so it might be difficult to standardize.
@ruffsl, My idea with having two poses is that the bounding box could actually have a different pose from the expressed object pose. For example, the bounding box center for a coffee mug might have some z-height and be off-center wrt the body of the mug, but the expressed object pose might be centered on the cylindrical portion of the mug and be at the bottom. However, maybe it makes sense to forego the bounding box information, as this could be stored in the object metadata, along with a mesh, etc.
On the topic of nesting, I'm open to the idea of flattening the hierarchy and having Classification/Detection 2D/3D all include a new `CategoryDistribution` message. I'm not sure how much message nesting is considered standard practice, so I'll look at some other packages to get an idea.
@Jeremie, I like the idea to add a standardized `VisualFeature.msg` or other similar message, as long as there is some common baseline that can cover a lot of feature types. From my own understanding of visual features, there's usually a lot of variation in how the feature is actually defined and represented, so I'm not able to find a "lowest common denominator" from my own experience. If you feel there's something there that could be broadly useful, please feel free to post it here or make a pull request. I agree with @Loy as well, although many classifiers use features internally, this should be hidden in the implementation except in special cases like the SLAM case described.
I didn't design the current messages to support per-pixel segmentation, and I'll have to look into how that is usually represented to get a good idea of how to craft a message for it. My initial guess is that it will be a separate message type from `Classification` and `Detection`.
On the topic of the parameter server, I think it's worth having a discussion about representation format. From talks with other OSRF folks, I don't think it's a good idea to use a tree of parameters; a single parameter would be better. For example, if you are loading the ImageNet class names, that's 1000 items on the parameter server, just to store the names. Add object meshes, sizes, etc., and it could balloon very quickly.
While JSON/XML/YAML might work equally well in terms of expressive power, with XML, we can be sure that both C++ and Python will have the ability to read the database. TinyXML is already included as a low-level dependency in ROS C++, but the same can't be said for a YAML or JSON parser. Rather than allow people to use whatever's convenient, I think it's worth it to restrict/recommend everyone to use a format that can be parsed from more languages. We could do it in the REP, but not enforce it, so if someone really wants to use YAML in their Python-only implementation, they could do so. That's my position, but I'm interested in hearing other ideas.
In regard to pixel labeling, I've also seen `sensor_msgs/Image` used as a means to publish, along with some custom structure to define the mapping between pixel values and labels on a separate topic. It would be cool to also have a message type to publish a array of convex bounding polygon verticies with label IDs. That's a common use case when labeling regions of an image, and would be good compressed representation to transmit instead for classification modalities that utilise that format.
This looks like it would be a clean implementation. Just to be sure (since I'm a segmentation newbie), the size of `results` would be the number of pixels in the `mask`? There's a distribution for each pixel?
[quote="AdamAllevato, post:11, topic:1819"]
First, let me start off by noting that although I only created Classification and Detection messages, I think it makes sense to keep this as a general vision_msgs package, and additional computer vision-related messages can be added as time goes on. I think it's more useful than making a too-specific classification_msgs or similar.
I am not an expert in this field, but I would like to clarify whether the classification and detection messages are specific to 2D image processing, or if they can also be used for 3D point cloud processing or even 2D laser scan processing. If there is a possibility that they may be used outside of image processing, then perhaps a `classification_msgs` or similar package actually is appropriate. Just something I think should be considered.
Is it likely that many pixels in the image will have identical distributions? It seems that "apple" pixels near the edge of the apple would have a different probability distribution than those near the center. All the ML-based segmentation systems I've seen either predict a single output class for a pixel (such as a binary classifier), or they produce probability vector
It seems like a bit of a halfway solution to define a small set of distributions that the image uses as an index, then transmit that set with every result. I feel that these two options would work based on use case:
1. The image is segmented in some small finite set of output classes, which do not have probability distributions that vary in space/time: use an Image message where the lookup value of the pixel is the output class. If desired, static probability distributions for each class can be communicated in a one-time fashion, such as via a single CategoryDistribution message, or via the parameter server
2. The output segmentation includes varying probability distributions that are calculated per-pixel or per-small region: use a CategoryDistribution of length the size of the image, where each pixel has its own unique distribution that may change every frame.
Let me know if I missed something! If you have some code available for a use case, that's really helpful. I'm currently in the process of writing example classifiers to use the Classification/Detection messages and finding it a useful exercise.
3D point cloud processing generally falls under the topic of "computer vision." But I had not considered laser scan processing, good point. The package name will probably be subject to review from more senior OSRF architects, and we'll keep that in mind!