It is with pleasure to share the successful completion of this Toyota Code Sprint in this final blog post. In this project, homography estimation based on multi-modal, multi-descriptor correspondence sets has been explored, and inspired the introduction of the multi-descriptor voting approach (MDv). The proposed MDv approach achieved a consistent accuracy in the 0.0X range, a level of consistency that is better than those based on single-type state of the art descriptors including SIFT. In the process, a framework for analyzing and evaluating single and multi-descriptor performance has been developed, and employed to validate the robustness of MDv, as compared with homography estimations based on a single descriptor type, as well as those based on RANSAC registration of best-K multi-descriptor correspondence sets. The code and dataset for this project are hosted on https://github.com/mult-desc/md, with dependencies on both PCL 1.7 and OpenCV 2.4.6.
Follows is an in-depth report detailing the project’s accomplishments, as well as design and validation considerations:Click here for a high resolution version of the report.
In the previous blog post I described my attempts to find a good balance between the contributions of the three terms (distance, normal, and color) to edge weight computation. As it often happens, as soon as I was done with the evaluation and the blog post, I realized that there is another type of information that could be considered: curvature. And indeed, it proved to have a very positive effect on the performance of the random walker segmentation.
Let me begin by exposing a problem associated with edge weights computed using the normal term alone (i.e. depending only on the angular distance between the normals of the vertices). Consider the following scene (left):
Voxelized point cloud (left) and a close-up view of the graph
edges in the region where the tall and round boxes touch
(right). The edges are colored according to their weights
(from dark blue for small weights to dark red for large
Most of the edges in the boundary region (right) are dark blue, however there are a number of red edges with quite large weights. This sort of boundary is often referred to as a “weak boundary” and, not surprisingly, has a negative effect on the performance of many segmentation algorithms. You can imagine that a boundary like this is a disaster for the flood-fill segmentation, because the flood will happily propagate through it. Luckily, the random walker algorithm is known for its robustness against weak boundaries:
Segmentations produced by the random walker algorithm
using three different choices of seeds (shown with red
In the first two cases one of the seeds is very close to the weak boundary, whereas another one is far away. In the third case there are multiple green seeds placed along the boundary, however a single purple seed is able to “resist” them from its remote corner.
This robustness has limits, of course. In the figure below the “table seed” is placed in the rear of the table, far from the boundaries with the boxes. The box segments managed to “spill” on the table through the weak boundaries:
Segmentation failure when a seed is placed too far from
a weak boundary
One way to address this problem is to make the sigma of the normal term smaller, therefore penalizing differences in normals’ orientations more. Enabling the distance term might also help, because the edges that contribute to the boundary weakness are often diagonal and therefore longer than the average. The figure below (left) demonstrates the graph edges in the same boundary region with new weights, computed using decreased normal term sigma (10% of the original one), and with the distance term enabled. (The overall edge color shift towards blue is due to it.)
Graph edges with weights computed with a smaller normal
term sigma and enabled distance term. Close-up view of
the region where the tall and round boxes touch (left)
and top-down view at the rear of the table (right).
Still, there are several edges with relatively large weights in the boundary region, but there are no large-weight paths connecting vertices on both sides of the boundary anymore. We got rid of the weak boundary, but this came at a price. Although the table itself is flat, the cloud that we get from the Kinect is not, and the further from the camera the more wavy it is. The image on the right shows a top-down view at the rear of the table, where the waves are particularly large. The edges that belong to the cavities between the “waves” were heavily penalized and virtually disappeared. Random walkers will have a hard time traversing this part of the table on their way to the boxes!
Having considered all of these I came to a conclusion that some additional geometrical feature is needed to improve the weighting function. Curvature was the first candidate, especially in the light of the fact that we anyways get it for free when estimating voxel normals (via PCA of the covariance matrix of the voxel neighborhood). I added one more exponential term to the weighting function:
where is simply a product of the voxel curvatures. Similarly to how it is done in the normal term, the product is additionally multiplied by a small constant if the angle between the voxels is convex (in order not to penalize convex boundaries).
The figure below demonstrates the edge weights computed using the new term alone:
Graph edges with weights computed using only the new
curvature term. Close-up view of the region where the
the rear of the table (right).
The boundary between the boxes is perfectly strong, whereas the weights in the rear of the table are not penalized too much. The seeding that resulted in a segmentation failure before no longer causes problems. I used the set of random seedings described in the last post to find the best sigmas for the extended weighting function. Then I generated 50 new seedings to test and compare the performance of the old (without the curvature term) and the new (with the curvature term) weighting functions. The figure below summarizes the distributions of under-segmentation errors:
The performance significantly improved both in terms of stability, quality, and number of failures (in fact, there are no failures at all).
The random walker segmentation algorithm requires that the data are modeled as a weighted graph, and the choice of edge weighting function has a great impact on the performance of the algorithm. In this blog post I will describe the weighting function and parameters that I ended up using.
Before talking about the weights of the edges between vertices, let’s discuss the vertices themselves. As mentioned in the previous blog posts, I have a pre-processing step where the input cloud is voxelized using OctreePointCloudAdjacency. Voxelization serves three purposes:
The voxels consequently become vertices of the graph, and each of them is connected with its neighbors by an edge. Each voxel has several properties: 3D position, normal orientation, and color, which may be used in edge weight computation.
As mentioned in the very first blog post on random walker segmentation, originally I used the edge weighting function from the following paper:
Later on I introduced several modifications. Now for a pair of vertices and the weight is defined as:
where , , and are the Euclidean, angular, and color differences between voxels, and the sigmas are used to balance their contributions. Compared to the weighting function of Lai et al., the color term was added, and scaling by mean values was removed.
I devised the following procedure in order to find appropriate values for the sigmas. I took a scene with known ground truth segmentation and generated 50 random proper seedings. Here by a “proper seeding” I mean a set of seeds where each seed belongs to a distinct ground truth segment, and each ground truth segment has a single seed (see example in the figure below on the left). For each of these seedings I ran random walker and computed the under-segmentation error (that was defined in one of the earlier blog posts, see example in the figure below on the right). Then I analyzed the distributions of errors resulted from different sigma values.
Voxelized point cloud with
one of the randomly
generated proper seeding
used in the experiments
Under-segmentation error of
the segmentation produced by
random walker from the
given seeds (erroneous voxels
Note that the ground truth itself is not perfect, because it is often impossible to tell apart the points at the boundary of two objects. Consequently, the ground truth segmentation is somewhat random at the boundaries, and we should not expect (or strive) any segmentation algorithm to produce exactly the same result. The under-segmentation error displayed above has 1039 erroneous voxels, and this is pretty much the best performance we could expect from a segmentation algorithm with this ground truth.
Let’s begin by examining the influence of the angular term. In this experiment I set Euclidean sigma to the value of voxel resolution and color sigma to 0.1. Below is a plot of under-segmentation error distributions for different choices of angular sigma (note that larger sigmas correspond to less influence):
Each distribution is visualized using a boxplot. The three main features are the position of the red bar (median of the distribution), size of the box (50% of the values fall inside the box), and the amount and positions of pluses (outliers). The first one gives an idea of the average performance of the algorithm. The second one expresses the segmentation stability with respect to the seed choice (with smaller box meaning better stability). The third one indicates segmentation failures. Indeed, a significant deviation of the under-segmentation error means that the output segmentation has large mis-labeled regions, which may be deemed as a failure.
Clearly, the median values of the distributions are almost the same. The differences are very small and due to the discussed properties of the under-segmentation error can not be used to draw conclusions of which sigmas are better. The box sizes, however, vary significantly. The sigmas from 0.1 to 1.1 yield the most stable performance. Also the number of failures is less for those sigmas. This evaluation does not provide enough information to chose any particular sigma in this range, so for now I settled on 0.2 (it is the second most stable, but yields less failures than 0.1).
In order to explore the influence of the color term, I set Euclidean sigma to the value of voxel resolution again and angular sigma to 0.2:
Note the last column, which shows the error distribution when the color term is removed completely. We see that sigmas from 0.1 to 0.15 provide slightly more stable results. Unfortunately it is not visible in the plot, but the number of pluses on the row is less for these sigmas. So I chose 0.15 as a result of this evaluation.
Speaking about the distance sigma, it turned out to have very small influence on the results. In most cases introduction of the distance term does not change the output at all. Still, in few cases it helps to avoid complete segmentation failure. It turned out that setting this sigma to the voxel resolution value gives the best results.
Finally, let me demonstrate the segmentations produced with the chosen sigmas. Among the 50 random seedings only 5 resulted in segmentation failure:
5 failed segmentations
Clearly, failures happened when a seed was placed either exactly on the boundary between two objects (#2), or on the outermost voxels of an object (#1, #3, #4, #5).
The remaining 45 seedings yielded good segmentations. Below are 15 of them (selected randomly):
15 succeeded segmentations
In the past weeks I decided to put the “spectral thing” on hold and turned back to the random walker segmentation. In this blog post I will talk about refactoring of the random walker algorithm and my experiments with different linear solvers. In the follow-up post I will explore how the segmentation results depend on seed placement, as well as discuss edge weighting functions and the choice of parameters for them.
First of all, I refactored and cleaned up my implementation. Recall that the random walker algorithm could be applied to cluster any kind of data as long as there is a way to model it using a weighted graph. I decided that it makes sense to have a generic templated implementation of the algorithm which would work with any weighted graph. An obvious choice to represent graphs in C++ is to use the primitives available in the Boost Graph Library (BGL). This library is very generic, feature-rich, and flexible, though at the expense of a rather steep learning curve. I took their implementation of the Boykov-Kolmogorov max-flow algorithm as an example of how to design the interface for a generic graph-based algorithm. In my case the public interface is just one templated function:
template<class Graph, class EdgeWeightMap, class VertexColorMap> bool randomWalkerSegmentation(Graph& g, EdgeWeightMap weights, VertexColorMap colors);
The user has to provide a graph, a property map that associates a weight to each edge of the graph, and a property map that contains initial vertex colors. (I adopted the term “colors” instead of “labels” because BGL has a pre-defined vertex property type with this name.) The output of the algorithm (i.e. label assignment) is written back to the color map. Internally the function instantiates a class, which does all the boring work of constructing and solving a system of linear equations, as well as interpreting its solution as a label assignment.
While this generic graph segmentation function might be useful for someone, the general audience will be interested in a class that implements a complete point cloud segmentation pipeline. This class should take care of converting an input cloud into a weighted graph, segmenting it, and turning the random walker output into a labeled point cloud. At the moment it is not clear for me how the first step should be designed. Indeed, there are multiple ways to represent a point cloud as a weighted graph, both in terms of topology and edge weights. Currently I voxelize the input cloud and use 26-neighborhood to establish edges between nodes. Alternatively, one may work with a full point cloud and use some fixed-radius neighborhood. One more option might be to generate a mesh and work with it. Exploration of these possibilities remains as a future work.
The second issue that I addressed recently was the performance of the algorithm. The main computational effort is spent on solving a sparse system of linear equations, where the number of equations is determined by the number of unlabeled vertices (i.e. basically the size of the whole point cloud). For example, the typical size of voxelized scenes from the OSD dataset that I use in my experiments is about 30000 vertices. Originally, I used the ConjugateGradient solver of Eigen, because it is “recommended for large symmetric problems”. The time needed to segment a typical point cloud with this solver is about 1 second on my three years old i5 laptop. I decided to try other options available in Eigen. In particular, I tested BiCGSTAB with Diagonal and IncompleteLUT preconditioner, SimplicialLLT, SimplicialLDLT, and SimplicialCholesky solvers. The figure below shows the runtime of these solvers with respect to the problem size. (Only one of the Simplicial*** solvers is plotted as they demonstrated very similar performance.)
The computation time depends linearly on the problem size for all solvers, however SimplicialLDLT has a much smaller growth rate. For a typical 30 thousand vertices problem it needs about 200 ms. What’s more, it can solve for multiple right-hand sides at the same time, whereas ConjugateGradient and BiCGSTAB can not. This means that as the number of labels (i.e. desired segments) grows, the computational time does not increase.
In fact, Eigen offers some more options such as CholmodSupernodalLLT, which is a wrapper for SuiteSparse package, and SparseLU, which uses the techniques from the SuperLU package. Unfortunately, the former complained that the matrices that I provide are not positive definite (though they actually are), and the latter is a very recent addition that is only available in Eigen 3.2 (which I do not have at the moment).
Taking into account the evaluation results I switched to the SimplicailLDLT solver in my random walker implementation.
I am pleased to announce that I just finished the second Toyota Code Sprint. The topic we tackled this time was superquadrics applied for Computer Vision tasks such as object modeling and object detection.
The code we developed was not yet integrated into PCL (it does use PCL for all the processing), but lies in a separate repository which you can find here: https://github.com/aichim/superquadrics .
An extensive report that presents some of the theory behind the concepts and algorithms we used in the project, as well as implementation details and results can be found at the end of this post.
At the end of this work, we present a set of ideas that can be used to extend the project in subsequent code sprints:
In the several previous posts I tried to provide some insight in how the spectral clustering technique may be applied to the point cloud processing domain. In particular, I have demonstrated different visualizations of eigenvectors and also did a manual analysis of one particular scene. In this blog post I will (finally) describe my algorithm that does automatic analysis of eigenvectors, which leads to unsupervised supervoxel clustering.
The input of the clustering algorithm is , a set of first eigenvectors of the graph Laplacian. Each eigenvector has elements that correspond to the supervoxels is the original problem space. The task of the algorithm is to determine the number of clusters that the data points form in the subspace spanned by the eigenvectors and, of course, assign points to these clusters.
The key insight drawn from previous examinations and discussions of the eigenvectors is that the clusters are linearly separable in one-dimensional subspaces spanned by the eigenvectors. In other words, for every pair of clusters there exists at least one eigenvector so that in its subspace these clusters are linearly separable. Based on this premise I built an algorithm which is a pretty straightforward instance of divisive hierarchical clustering approach. It starts with a single cluster that contains all the data points and recursively splits it in a greedy manner. The following pseudo-code summarizes the algorithm:
The interesting part are, of course, FindBestSplit and SplitQuality functions. But to get this straight, let me first define what a “split” is. A split is a tuple , where is the eigenvector along whose subspace the split occurs, and is a threshold value. The points that have their corresponding elements in the eigenvector less than go to the first cluster, and the remaining go to the second. For example, the figure below shows with a red line a split on the left and a split on the right:
Example splits in the subspaces spanned by the
Which of these two splits is better? Intuitively, the one on the left is more promising than the one on the right. But how to define the split quality? My initial approach was to use the difference between the points immediately above and below the splitting line as the measure. This worked to some extent, but was not good enough. Then I switched to a measure based on the relative densities of the bands above and below the splitting line. Consider the figure below:
Example splits with three bands highlighted. The
“split” band is shown in red, the “top” and “bottom”
bands are shown in yellow
The “split” band is the region of low density around the splitting line. The “top” and “bottom” bands are the high density regions immediately above and below the “split” band. I will omit the details of how these bands are computed, because it is likely that I modify the implementation in future. The quality of the split is defined as , where is the density of the corresponding band.
With this quality measure at hand, the FindBestSplit function simply iterates over all available one-dimensional subspaces (i.e. over all eigenvectors) and finds the split with the highest quality.
The performance of the algorithm is excellent on simple scenes:
Spectral supervoxel clustering of simple table-top scenes
And is rather good (though definitely not perfect) on more cluttered ones:
Spectral supervoxel clustering of cluttered table-top scenes
For example, the green book in the first scene is split into two clusters. In the second scene two small boxes in the center of the image are erroneously merged into one cluster. I think these issues are closely related with the split quality measure and the threshold associated with it. Definitely, there are ways to improve these and I plan to work on it the future.
Before exposing the clustering algorithm as promised in the last blog post, I decided to motivate it by showing how the eigenvectors may be analyzed “by hand”. Hopefully, this will also provide more intuition about what these eigenvectors actually are and how they are related with the data.
Just to remind, the problem I am trying to solve is about segmenting a set of supervoxels into meaningful components. Here a meaningful component means a subset of supervoxels that are close to each other in Euclidean sense, and are separated from the rest by a sharp change in orientation. If we view each supervoxel as a point then the problem is about clustering points in a -dimensional space. (Currently since supervoxels have 3 Euclidean coordinates plus 3 coordinates of the normal vector, however additional dimensions, e.g. color, may be added later.) The difficulty of the problem comes from the fact that the components may have arbitrary irregular shape in these dimensions. Therefore I want to map these points so some other space where the components will correspond to tight clusters, perhaps even linearly separable. The current idea is to use a subspace spanned by the first few eigenvectors of graph Laplacian of the original data.
In the last blog post I provided a visualization of the eigenvectors in the original problem space. For convenience, here are the first four eigenvectors again:
The first 4 eigenvectors in the original problem space
I want to perform clustering in a subspace though, so it is helpful to develop an intuition about how the data look like in it. The figure below demonstrates the data points (that is, supervoxels), projected on each of the first four eigenvectors. (Here and in what follows the data is whitened, i.e. de-meaned and scaled to have unit variance. Additionally, the values are sorted in increasing order.)
Data points in subspaces spanned by the first 4 eigenvectors
Obviously, in each of these subspaces (except for the second) the data is linearly separable in two clusters. What is not obvious, however, is how many clusters there will be in the combined subspace. The next figure shows data points in subspaces spanned by two different pairs of eigenvectors:
Data points in subspaces spanned by first and third (left)
and fourth and third eigenvectors (right)
Now it becomes evident that there are at least three clusters. Will there be more if we consider the subspace spanned by all these three eigenvectors? It turns out there will, see the point cloud below:
Unfortunately, we have just approached the limit in terms of how many dimensions could be conveniently visualized. Though for this particular data set it is enough, there won’t appear more clusters if we consider additional eigenvectors. Summarizing, in the subspace spanned by the first four eigenvectors the data points form four tight and well-separated (linearly) clusters. And these clusters actually correspond to the three boxes and the table in the original problem space.
Now the only thing left is to develop an algorithm which would do this kind of analysis automatically!
Correspondence rejection classes implement methods that help eliminate correspondences based on specific criteria such as distance, median distance, normal similarity measure or RanSac to name a few. Couple of additional filters I’ve experimented with include a uniqueness measure, and Lowe’s ratio measure as in “Distinctive image features from scale invariant keypoints”, D.G. Lowe, 2004. I’ve also explored the tradeoffs in implementing the filters within CorresondenceEstimation itself, or as external CorrespondenceRejection classes. The former is computationally more efficient if the rejection process is done in one pass, while the latter allows for scene-specific squential filter banks.
Follows is a quick reference guide of the available correspondence rejection classes with remarks extracted from the source code.
I concluded the last blog post by noting that it seems to be possible to segment objects based on analysis of the eigenvectors of Laplacian constructed from the point cloud. This time I will provide a visual interpretation of eigenvectors and then describe the problem of their analysis.
Let me start with a quick note on eigenvector computation. As mentioned before, they are obtained through eigendecomposition of Laplacian that represents the surface. In the beginning I used SelfAdjointEigenSolver of Eigen library. It runs in time (where is the number of points), which obviously does not scale well. Later I switched to SLEPc. It can limit computation only to a desired number of first eigenpairs and therefore does the job much faster, but still seems to have polynomial time. Therefore I decided to execute supervoxel clustering as a pre-processing step and then compute the distances over the supervoxel adjacency graph, which has a dramatically smaller size than the original point cloud.
Now let’s turn to the eigenvalues themselves. The figure below demonstrates the voxelized point cloud of a simple scene (left) and supervoxel adjacency graph (right), where the adjacency edges are colored according to their weights (from dark blue for small weights to dark red for large weights):
Voxelized point cloud (voxel
size 0.006 m)
Supervoxel adjacency graph
(seed size 0.025 m), colored
according to edge weight
Eigendecomposition of Laplacian of this weighted adjacency graph yields a set of pairs . Each eigenvector has as many elements as there are vertices in the graph. Therefore it is possible to visualize an eigenvector by painting each supervoxel according to its corresponding element in the vector. The figure below shows the first 9 eigenvectors which correspond to the smallest eigenvalues:
First 9 eigenvectors of the graph Laplacian
The first eigenvector clearly separates the scene into two parts: the third box and everything else. In the second eigenvector the table is covered with gradient, but the first and third boxes have (different) uniform colors and, therefore, stand out. The third eigenvector highlights the second box, and so on.
I have examined quite a number of eigenvectors of different scenes, and I think the following common pattern exists. First few eigenvectors tend to break scene in several regions with uniform colors and sharp edges. In the next eigenvectors gradients begin to emerge. Typically, a large part of the scene would be covered with gradient, and a smaller part (corresponding to a distinct component of the graph) would be have some uniform color.
The overall goal is to figure out the number of distinct components of the graph (that is, objects in the scene) and segment them out. As I admitted before, it is clear that the eigenvectors capture all the information needed to do this, so the question is how to extract it. In fact, this problem has already received a lot of attention from researchers under the name of “Spectral Clustering”. (Yeah, I made quite a detour through all these distance measures on 3D surfaces before I came to know it). The standard approach is described in the following paper:
In a nutshell, the original problem space typically has many dimensions (in our case the dimensions are Euclidean coordinates, normal orientations, point colors, etc.). The clusters may have arbitrary irregular shapes, so they could neither be separated linearly, nor with hyper-spheres, which renders standard techniques like K-means inapplicable. The good news are, in the subspace spanned by the first few eigenvectors of Laplacian of the graph (constructed form the original data) the data points form tight clusters, and thus K-means could be used. This effect is evident in the eigenvectors that I demonstrated earlier. Unfortunately, the number of clusters still needs to be known. There exist literature that addresses automatic selection of the number of clusters, however I have not seen any simple and reliable method so far.
In the next blog post I will describe a simple algorithm that I have developed to analyze the eigenvectors and demonstrate the results.
With my current work on optimizing correspondence estimation across the uv/xyz domains, it is worth providing a topology of the available correspondence estimation classes in PCL. For a highlevel treatment of the registration API, please refere to the registration tutorial.
Correspondence estimation attempts to match keypoints in a source cloud to keypoints in a target cloud, based on some similarity measure, feature descriptors in our case. Although applying scene relevant descriptor parameters and correspondence thresholds may reduce erronous matches, outliers persist with impact on pose estimation. This is due to the implied assumption that for each source keypoint, a corresponding target keypoint exists. The difficulty in estimating model or scene-specific descriptor parameters is another factor.
Follows is a quick reference guide of the available correspondence estimation classes with remarks extracted from the source code.
Last time I wrote about distance measures on 3D surfaces, though I did not give any details about how they are computed. In this blog post I will give a formal definition, followed by two important properties that simplify the computation and provide insights that might help to solve the ultimate goal: identification of distinct objects in a scene.
Given a mesh (or a point cloud) that represents a surface, a discretization of the Laplace-Beltrami operator (LBO) is constructed. This discretization is a sparse symmetric matrix of size , where is the number of vertices (points) in the surface. The non-zero entries of this matrix are the negated weights of the edges between adjacent vertices (points) and also vertex degrees. This matrix is often referred to as Laplacian. Eigendecomposition of Laplacian consists of pairs , where are eigenvalues, and are corresponding eigenvectors.
The diffusion distance is defined in terms of the eigenvalues and eigenvectors of Laplacian as follows:
The biharmonic distance bears a strong resemblance to it:
Here means -th element of eigenvector . Both distances have a similar structure: a sum over all eigenpairs, where summands are differences between corresponding elements of eigenvectors scaled by some function of eigenvalues. There are two properties of these distances that I would like to stress.
Firstly, the summands form a decreasing sequence. The figure below illustrates this point with eigenvalues of Laplacian of a typical point cloud:
In the left image the first hundred of eigenvalues (except to which is always zero) are plotted. Note that the values are normalized (i.e. divided) by . The magnitudes of eigenvalues increase rapidly. On the right the multipliers of both diffusion and biharmonic distances are plotted (also computed with normalized eigenvalues). The biharmonic distance multiplier is plotted for several choices of the parameter . Clearly, only a few first terms in the summation are needed to approximate either of the distances. This has an important consequence that there is no need to solve the eigenproblem completely, but rather is suffices to find a limited number of eigenpairs with small eigenvalues.
Secondly, the distance between two points and depends on the difference between their corresponding elements in eigenvectors and . The figure below demonstrates the (sorted) elements of a typical eigenvector:
One may see that there are groups of elements with the same value. For example, there are about one hundred elements with value close to . The pair-wise distances between the points that correspond to these elements will therefore be close to zero. In other words, plateaus in eigenvector graphs correspond to sets of incident points, and such sets may be interpreted as objects.
Summing up, it seems like it should be possible to identify distinct objects in a point cloud by analyzing the eigenvectors of Laplacian (even without explicitly computing any of the distance measures). Moreover, only a few first eigenvectors are relevant, so it is not necessary to solve the eigenproblem entirely.
In the recent weeks I have been developing an approach which would allow automatic selection of seed points. I decided to proceed with the methods which use graph-based representation of point clouds and, as the matter of fact, are closely related with random walks on those graphs. I still do not have any solid results, though there are some interesting outputs that I would like to share in this blog post.
In the domain of mesh segmentation, or more generally 3D shape analysis, there is a fundamental problem of measuring distances between points on a surface. The most trivial and intuitive is geodesic distance, which encodes the length of the shortest path along the surface between two points. It has a number of drawbacks, the most important being its sensitivity to perturbations of the surface. For example, introducing a hole along the shortest path between two points, or a small topological shortcut between them may induce arbitrary large change in the distance. This is an undesired property, especially considering the noisy Kinect data that we work with.
A more sophisticated distance measure, diffusion distance, is based on the mathematical study of heat conduction and diffusion. Suppose we have a graph that represents a shape (it may be constructed exactly the same way as for the Random Walker segmentation). Imagine that a unit amount of heat is applied at some vertex. The heat will flow across the edges and the speed of its diffusion will depend on the edge weights. After time has passed, the initial unit of heat will be somehow distributed among the other vertices. The Heat Kernel () encodes this distribution. More specifically, is the amount of heat accumulated after time at vertex if the heat was applied at vertex . Based on this kernel the diffusion distance between each pair of points is defined. Importantly, the distance depends on the time parameter and captures either local or global shape properties.
Another distance measure, commute-time distance, is the average time it takes a random walker to go from one vertex to the other and come back. Finally, biharmonic distance was proposed most recently in:
This distance measure is non-parametric (does not depend on e.g. time) and is claimed to capture both local and global properties of the shape.
The figure below demonstrates biharmonic, commute-time, and heat diffusion distance maps computed with respect to the point marked with a red circle:
Left column: voxelized point cloud (top), biharmonic distance
(middle), commute-time distance (bottom).
Right column: heat diffusion distance for several choices of
the time parameter.
In each image the points with smallest distance are painted in dark blue, and the points with largest distances are dark red. The absolute values of distances are very different in all cases.
I think these distance maps could be used to infer the number of distinct objects in the scene. Indeed, the points that belong to the same object tend to be equidistant from the source point, so different objects correspond to different blobs of homogeneous points. Finding objects thus is the same as finding modes of the distribution of distances, which could be accomplished with Mean-Shift algorithm.
Speaking about particular choice of distance measure, biharmonic distance and heat diffusion distance with large time parameter intuitively seem to be better than others, however this is a subject for a more careful examination.
This is a follow-up post to the segmentation using random walks. It turned out that my initial implementation had several bugs which significantly worsened the performance. In this blog post I will describe them and show the outputs produced by the fixed algorithm.
The segmentation results that I demonstrated last time expose two (undesirable) features: vast regions with random label assignment and “zones of influence” around seed points. My first intuitive explanation was that since the edge weights are always less than 1, a random walker can not get arbitrary far from its starting point, therefore if a seed is reasonably far, the probability of getting there is very small. Surprisingly, I did not find any mentions or discussions of this effect in the literature. Moreover, while carefully re-reading “Random Walks for Image Segmentation” I mentioned that the algorithm outputs for each point and label pair not the probability that a random walker started from that point will reach the label, but rather the probability that it will first reach that label (i.e. earlier than other labels).
I decided to visualize edge weights and the probabilities I get with my implementation. In the following experiment I labeled three points, one on the table, and the other two on the boxes (see the left figure):
Color point cloud
with three labeled
a random walker
will first reach the
top right label
The figure in the middle shows all edges along which random walkers move. Each edge is visualized by its middle point colored according to edge weight (using “jet” color map where 0 is dark blue and 1 is dark red). The weights do make sense, as the edges in the planar regions are mostly reddish (i.e. weight is close to 1), whereas on the surface boundaries they are bluish (i.e. weight is close to 0). Moreover, it is clear that concave and convex boundaries have different average weights.
The figure on the right shows all points colored according to the probability that a random walker started from that point will first reach the labeled point on the box in the back. The probability images for the other labels are similar, thus for the majority of points the probability of first reaching either label is close to zero. Clearly, this can not be right, because some label has to be reached first anyways, and therefore the probabilities at each point should sum up to one. This observation triggered a long search for bugs in my implementation, but in the end I discovered and fixed two issues.
As I mentioned before, the solution for random walker segmentation is obtained by solving a system of linear equations, where coefficients matrix consists of edge weights and their sums, arranged in a certain order. The system is huge (there are as many equations as there are unlabeled points), but sparse (the number of non-zero coefficients in a row depends on the number of edges incident to a point). The first issue was due to a bug (or a feature?) of OctreePointCloudAdjacency class that I used to determine neighborhoods of points. In a nutshell, it is a specialized octree, where leaves store pointers to their direct neighbors (in terms of 26-connectivity). For some reason, a leaf always stores a pointer to itself in the list of neighbors. This caused bogus self-edges in my graph which themselves did not show up in the matrix (because diagonal elements are occupied by sums of weights of incident edges), however treacherously contributed to those sums, thus invalidating them.
The second issue was more subtle. For the system to have a solution it has to be non-singular. In terms of the graph it means that it either has to be connected, or should contain at least one seed in every connected component. From the beginning I had a check to enforce this requirement, however I did not take into account that the weighting function may assign zero weight to an edge! In the pathological case all the edges incident to a point may happen to be “weightless”, thus resulting in an all-zero row in the system.
Below are the results obtained using the random walker algorithm after I fixed the described issues:
Random walk segmentation
with overlaid seed points
Probabilities that a random
walker will first reach each
of the labels
As I did it previously, I labeled a single point in each object. (Actually, a single point in each disjoint part of each object. That is why there are multiple labeled points on the table.) The scene is perfectly segmented. On the right is an animated GIF with a sequence of frames which show probabilities (potentials) for each label. In most cases the separation between object and background is very sharp, but in two cases (the small box on top of the book and standing box on the right) some non-zero probabilities “spill” outside the object. Nevertheless, this does not affect the final segmentation since the potentials for the correct labels are higher.
I am very happy with the results, however one should remember that they were obtained using manual selection of labels. This code sprint is aimed at an automatic segmentation, so next I plan to consider different strategies of turning this into a fully autonomous method.
In the last few weeks I have been working on two things in parallel. On one hand, I continued to improve supervoxel segmentation, and a description of this is due in a later blog post. On the other hand, I started to look into an alternative approach to point cloud segmentation which uses random walkers. In this blog post I will discuss this approach and show my initial results.
The random walker algorithm was proposed for interactive image segmentation in:
The input is an image where several pixels are marked (by the user) with different labels (one for each object to be segmented out) and the output is a label assignment for all the remaining pixels. The idea behind the algorithm is rather intuitive. Think of a graph where the nodes represent image pixels. The neighboring pixels are connected with edges. Each edge is assigned a weight which reflects the degree of similarity between pixels. As it was mentioned before, some of the nodes are marked with labels. Now take an unlabeled node and imagine that a random walker is released from it. The walker randomly hops to one of the adjacent nodes with probability proportional to the edge weight. In the process of this random walk it will occasionally visit labeled nodes. The one that is most likely to be visited first determines the label of the node where the walker started.
Luckily, one does not have to simulate random walks from each node to obtain the probabilities of arriving at labeled nodes. The probabilities may be calculated analytically by solving a system of linear equations. The matrix of coefficients consists of edge weights and their sums, arranged in a certain order. The author did a great job explaining why this is so and how exactly the system should be constructed, so those who are interested are referred to the original paper.
Originally, the algorithm was applied to segment 2D images, however it could be used for any other data as long as there is a way to model it using a weighted graph. For us, of course, 3D point clouds are of a particular interest. The following paper describes how random walker segmentation could be applied for meshes or point clouds:
The authors construct the graph from a point cloud is the following way. Each point in the cloud becomes a node in the graph and the normal vector is computed for it.
For a pair of nodes and two distances are defined:
In the angular distance is a coefficient which depends on the relative concavity between points. For the convex case it is set to , effectively discounting the difference, whereas for the concave case it is equal to .
Using K-nearest neighbors search the neighborhood of a node is established. An edge is created between the node and each other node in the neighborhood. The weight of the edge depends on the two distances and is defined as:
and are used to balance the contributions of different distances. and in the denominators stand for the mean values of the distances over the whole point cloud.
In my implementation I decided to voxelize point cloud like it is done in SupervoxelSegmentation. I also reused the OctreePointCloudAdjacency class contributed by Jeremie Papon to establish the neighborhood relations between voxels. The weight is computed exactly as it is proposed by Lai et al. Finally, I use the Sparse module of Eigen to solve the linear system.
I mentioned before, but did not stress attention on the fact that this segmentation approach is semi-automatic. This means that the user has to provide a set of labeled points. My current implementation either accepts user input or generates labeled points uniformly using the same approach as SupervoxelSegmentation does for seed generation. There are smarter ways of producing initial labeling and I plan to consider this issue later.
Here are the some initial results that I got using random walk segmentation algorithm. The big red dots show the locations of labeled points:
Voxelized point cloud (voxel
size 0.006 m)
Random walk segmentation
(with one manually labeled
point per object)
Random walk segmentation
(with multiple manually
labeled points in each object)
Random walk segmentation
(labeled points are uniformly
In the first case (top right) I manually labeled each object in the scene with one point. Many of the object boundaries are correctly labeled, however there are vast regions with totally random labels. This is especially evident in the upper right corner, where there is just one seed. A circular region around it is correctly labeled with single color, however the points outside of it are all randomly colored. According to my understanding, this happens because the edge weights are always less than 1, so a random walker can not get arbitrary far from the starting point. Thus if there are no labeled points in the vicinity of a point, then all the computed probabilities are nearly zero and label assignment happens randomly (because of numerical imprecision).
In the second case (bottom left) I manually labeled each object in the scene with multiple points guided by an intuition about how large the “zones of influence” are around each labeled point. The resulting segmentation is rather good.
In the third case (bottom right) the labeled points were selected uniformly from a voxel grid with size 10 cm. As a result I got an over-segmentation that resembles the ones produced by SupervoxelSegmentation.
I think the initial results are rather promising. In the future I plan to work on the weighting function as it seems to be the key component for the random walk segmentation. I would like to understand if and how it is possible to vary the size of the “zone of influence” around labeled points.
I concluded the previous blog post with a claim that the improved refinement procedure yields a better segmentation. That conclusion was based purely on a visual inspection and rang a bell reminding that it is time to start thinking about quantitative evaluation metrics and prepare a tool set to preform evaluations. In this blog post I will describe my first steps in this direction.
Let us start with a prerequisite for any evaluation, ground truth data. So far I have been using scenes from Object Segmentation Database (OSD). This dataset contains a ground truth annotation for each point cloud, however I am not completely satisfied with its quality. Consider the following point cloud (left) and the provided annotation (middle):
Color point cloud
Some of the points in the close vicinity of the object boundaries are wrongly annotated. For example, almost all points on the left edge of the standing box are annotated as belonging to the table. I hypothesize that the annotation was obtained automatically using some color image segmentation algorithm. (Thus it is not really “wrong”, it is correct in terms of the color data.) As we are working with 3D point clouds, the spatial relations derived from the depth data should have larger weight than color and the annotation should be revised to respect geometrical boundaries of the objects. Unfortunately, there is no tool in PCL that would allow to load a point cloud and edit the labels. It costed me quite a few hours of work to put together a simple editor based on PCLVisualizer that could do that. An example of refined annotation is shown in the figure above (right).
The authors of SupervoxelSegmentation used two metrics to evaluate how nicely the supervoxels adhere to the object boundaries: under-segmentation error and boundary recall. These metrics were introduced in:
I decided to begin with the under-segmentation error. Formally, given a ground truth segmentation into regions and a supervoxel segmentation into supervoxels , the under-segmentation error for region is defined as:
Simply put, it takes a union of all the supervoxels that overlap with a given ground truth segment and measures how much larger its total size is than the size of the segment. The authors of SupervoxelSegmentation use a slightly modified version which summarizes the overall error in a single number:
where is the total number of voxels in the scene. In practice, this error sums up the sizes of all the supervoxels that cross the ground truth boundaries.
In my opinion, this definition biases error with the average supervoxel size. Consider the supervoxel segmentation (left):
(seed size 0.1 m)
cross ground truth
according to the
Clearly, the segmentation is rather good in terms of the border adherence. The only failure is in the right corner of the lying box, where the green supervoxel “bleeds” on the table. The middle image shows which voxels will be counted towards the error, i.e. those supervoxels that cross the ground truth boundaries. In fact, every supervoxel in the close vicinity of an edge is counted. The reason is simple: the edge between two surfaces in a typical point cloud obtained with an RGB-D camera is ambiguous. The ground truth segmentation is therefore random to some extent and the chance that the supervoxel boundary will exactly coincide with the ground truth segment boundary is close to zero. Consequently, the error always includes all the supervoxels around the boundary, and the larger they are on average, the larger the error is. The “real” error (“bled” green supervoxel) is completely hindered with it.
I decided to use a modified definition. Whenever a supervoxel crosses ground truth boundary(ies), it is split in two (or more) parts. I assume that the largest part is correctly segmented, and only count the smaller parts as erroneous. The figure above (right) demonstrates which voxels would be counted towards the error in this case. Still, there is some “noise” around boundaries, but it does not hinder the “real” error. I also do not like that the error in the original definition is normalized by the total point cloud size. I think that the normalization should be related to the amount of objects or the amount (length) of object boundaries. I will consider this options in future, but for now I just count the number of erroneous pixels and do not divide it by anything.
I would like to conclude with the results of evaluating simple and improved supervoxel refinement procedures using the described error metric. The figure below shows the voxels considered as erroneous (in red) in supervoxel segmentations obtained with 2 rounds of simple refinement (left) and 2 rounds of improved refinement (right):
(seed size 0.1 m, with
(seed size 0.1 m, with
The under-segmentation error without refinement is 15033. Simple refinement reduces it to 11099 after the first iteration and 10755 after the second. Improved refinement reduces it to 9032 and 8373 respectively. I think the numbers do agree with the intuition, so I will continue using this metric in future.
Last time I mentioned that the simple refinement procedure does not always lead to good results. In this blog post will I discuss some of the problems and show an improved refinement procedure that addresses them.
Here is another scene with a cluttered table from Object Segmentation Database (OSD):
Color image of the scene
Fragment of the voxelized
point cloud (voxel size 6 mm)
I will concentrate on the stack with two books and two boxes in the foreground. The figures below demonstrate the results of supervoxel segmentation without and with one round of simple refinement:
Supervoxel segmentation (seed size 0.1 m) without (left) and
with (right) refinement. Supervoxel centroids are overlaid.
Some supervoxels are enumerated on the left image.
There are many deficits in segmentation output if no refinement is used. #3 is split into two large parts and #8 has a small disjoint region. #2, #5, and #9 each cover two distinct surfaces. The simple refinement improves segmentation in some respects (#1 and #4 became better, #8 no longer has a disjoint part), however it fails to “join” #3 and, what’s more, splits #5 and #7. Performing more rounds of the simple refinement does not help and e.g. after 5 iterations all three supervoxels remain disjoint.
Obviously, the simple refinement can only help if after segmentation the supervoxel centroid moved to a new spot which is better than the original seed. Unfortunately for supervoxel #3 the centroid ends up somewhere in between the two parts, therefore the seed does not move significantly, and the same result is reproduced over and over again.
The simple refinement consists of creating new seeds from the centroids of supervoxels and running the segmentation algorithm again. I decided to improve it by splitting disjoint supervoxels prior to re-seeding. In each supervoxel I compute connected components (in terms of Euclidean distance). Each connected component that is not too small (has at least 15 points) defines a new seed voxel.
Below is an animated GIF where frames show supervoxels after each round of the improved refinement procedure. The number in the upper right corner gives round number (0 means before refinement):
5 rounds of supervoxel refinement with splitting
In the first iteration the yellow supervoxel is successfully split into two parts. As mentioned before, this refinement iteration degrades segmentation by making two other supervoxels disjoint. But this is fixed in the next iteration of refinement when they are also split. In the last two iterations the number of supervoxels does not change anymore, they only re-shape slightly.
The obtained supervoxel segmentation is far from being perfect. Nevertheless, it was significantly improved using updated refinement procedure.
In this blog post I will discuss one of the problems with supervoxel segmentation and a simple refinement procedure that resolves it.
Let us consider exactly the same part of the scene as in the previous blog post. For reference here are (again) the voxelized point cloud and object segmentation that I am getting:
Voxelized point cloud (voxel
size 0.006 m)
Object segmentation (seed
size 0.1 m) with overlaid
Below is the output of supervoxel segmentation. It is exacltly the same as in the previous post, though the colors are different (because they are chosen randomly on each execution). Additionally the seeds from which supervoxels evolved and their centroids are visualized:
Supervoxel segmentation (seed size 0.1 m) with overlaid
supervoxel seeds (red), supervoxel centroids (blue), and
Last time I focused attention on the green and pink (formely cyan and blue) supervoxels in the central part of the image. The problem I mentioned is that the pink supervoxel split the green one into two disjoint parts. Another problem I did not talk about is that both supervoxels extend over the edge between the book spine and cover. This is undesirable because spine and cover are two distinct surfaces with different geometrical properties.
According to my understanding, the cause of this undesired segmentation is unlucky initial seeds. Both problematic supervoxels evolved from seeds which are on (or almost on) the edge between spine and cover. Right from the beginning both supervoxels started to absorb voxels from both surfaces. Only towards the end the pink supervoxel started to converge to the cover surface and the green one to the spine surface. But since the number of iterations is limited, the supervoxels never reached convergence. (Please see here how the supervoxels evolved.)
The described problems have a heavy impact on the final object segmentation. Firstly, the green supervoxel is considered to be adjacent to the dark green supervoxel in the center because of the tiny disjoint part we discussed last time. The orientations of these two supervoxels are similar (they both are vertical surfaces), therefore an edge is established and they are merged in the same object. (Unfortunately it is impossible to see in the image, but actually there is no edge from the dark green supervoxel to the pink one, but rather an edge from it to the green one, and then from the green one to the pink supervoxel.) Secondly, there is an edge between the pink supervoxel and the brown one below it because they are adjacent and have similar orientations. Was the supervoxel segmentation better, they would not touch each other and would not be connected. This explains why in the final object segmentation the table surface is merged with the book and the box on top of it.
I see two ways to address this problem. First is to come up with a smart seeding algorithm which would ensure that the seeds do not land on the edges between surfaces and at the same time guarantee an even distribution of seeds over the space. Second way is to introduce a refinement step which would post-process the supervoxels output by the segmentation algorithm.
In fact, there already exists a function SupervoxelSegmentation::refineSupervoxels(). It has to be invoked explicitly after the initial segmentation is obtained. The function simply uses the centroids of supervoxels as seeds and runs the segmentation algorithm once again. So, in a sense, it does implement the “smart seeding” approach. The improvement is massive. Below are the results of supervoxel and object segmentation when a single round of refinement is used:
Supervoxel segmentation (seed
size 0.1 m, with refinement)
Object segmentation (seed
size 0.1 m, with refinement)
The final result is very good, no object is merged with any other one or the table. Unfortunately, I do not get equally good segmentations of other cluttered scenes. In the next blog post I will demonstrate them and discuss how this simple refinement step could be improved.
As I mentioned in the previous blog post, I started to zoom in at the areas that cause problems for the segmentation/clustering algorithms. This made me explore how supervoxels actually grow, and in this short post I would like to share the gained insights.
Here is a part of the scene I used in the previous blog post observed from a slightly different viewpoint. On the left is the voxelized point cloud and on the right is the result of supervoxel segmentation with the default seed size (click on the image to see full-resolution version):
|Voxelized point cloud (voxel size 0.006 m)||Supervoxel segmentation (seed size 0.1 m)|
One thing that caught my attention in this segmentation is the cyan supervoxel in the central part of the image. If you examine it carefully you will mention that it consists of two disjoined parts: a big blob to the south-west of the center and a few voxels that are exactly in the center of the image. The distance between these two parts is quite large compared to the seed size, so I became curious about how exactly this segmentation came about. Even though the tutorial and the paper provide a detailed explanation of the process, I decided to visualize it to gain a better understanding.
Below is an animated GIF where frames show supervoxels after each of the 28 iterations made by the algorithm to segment the input point cloud:
|Supervoxel growth (seed size 0.1 m)|
The wavefront of the blue supervoxel chases the wavefront of the cyan one. At some point in time it breaks it into two parts and eventually “kills” the left one. The right one survives, presumably because its voxels are on the vertical surface of the box, which is similar to the rest of the cyan cluster. Therefore the blue supervoxel (which is mostly “horizontal”) can not seize them.
Although this particular case does not seem to be much of a problem, I could imagine a situation when a supervoxel is split into two or more chunks that are equally large. The centroid and normal of such a “broken” supervoxel will make no geometrical sense. Therefore reasoning about them in further processing steps will also make no sense and lead to random results.
How to cure this problem is an open question. We could require that supervoxels stay connected during expansion, though it might be computationally expensive to enforce. Alternatively, we could add a post-processing step which will split disjoined supervoxels into several smaller supervoxels. I put this issue in the todo-list and will come back to it later.
In the previous blog post I briefly mentioned SupervoxelSegmentation algorithm that has recently become available in PCL. Today I would like to discuss it in a greater detail.
In a nutshell, SupervoxelSegmentation generates an over-segmentation of a 3D point cloud into small spatially compact regions (supervoxels) in which all points possess similar local low-level features (such as color, normal orientation). The main properties of the algorithm are that the supervoxels are evenly distributed across 3D space and (in most cases) do not cross object boundaries. A detailed explanation of the algorithm could be found in this tutorial and in the original paper:
Let’s consider a scene with a cluttered table from Object Segmentation Database (OSD):
|Color image||Voxelized point cloud (voxel size 6 mm)|
Here are the results of supervoxel segmentation with two different seed sizes (0.1 m, which is the default, and 0.03 m):
|Supervoxel segmentation (seed size 0.1 m)||The same with overlaid adjacency graph|
|Supervoxel segmentation (seed size 0.03 m)||The same with overlaid adjacency graph|
The over-segmented point clouds look like patchwork sheets. The smaller the seed size, the smaller the patches are. The white meshes represent the adjacency graph of the supervoxels. It is not immediately obvious which seed size would result in a better final object segmentation. On one hand, smaller patches are more likely to reflect true object boundaies. Furthermore, in order to segment small objects the patch size should not be greater that the smallest object we wish to segment. On the other hand, smaller patches mean more edges and more time to process them. Depending on the asymptotic complexity of the merging algorithm that could become a crucial consideration.
The next step after supervoxel segmentation is to merge supervoxels to obtain final object segmentation. To begin with, I decided to implement the algorithm proposed in the following paper:
The basic idea is to assign each edge in the adjacency graph a weight that expresses the difference (or dissimilarity) between the pair of supervoxels that it connects. The algorithm starts by sorting the edges in the non-decreasing weight order and also creates disjoint sets, where each set contains supervoxel belonging to the same object. In the beginning each supervoxel gets its own set. Then the algorithm iterates over all edges and unites the sets to which the supervoxels it connects belong if a certain condition holds. The condition I use is that the weight of the edge should be small compared to the weights of the edges between the supervoxels that are already in the sets. This algorithm is actually a modification of Kruskal’s algorithm for finding a minimum spanning tree in a connected weighted graph.
Defining a good difference function for a pair of supervoxels is crucial for the performance. It should consider all the available information about supervoxels, including their size, area, border, orientation, color, and so on. At the moment I use relatively simple function which only depends on the orientation of the supervoxels. More specifically, if two supervoxels have centroids and and average normals and , then the difference is:
In other words, if the supervoxels are relatively concave to each other, then the dissimilarity is proportional to the angle between their normals. Otherwise it is zero, i.e. relatively convex supervoxels are always similar.
Here are the results of running the algorithm on the supervoxel segmentations that were demonstrated before. The figures on the right show the edges that the algorithm kept (used to unite sets) on top of the object clusters:
|Object segmentation (seed size 0.1 m)||The same with overlaid edge graph|
|Object segmentation (seed size 0.03 m)||The same with overlaid edge graph|
There are a number of problems in both cases. For the large supervoxel size case the book at the bottom is united with the box on top of it, and the second book is segmented into two parts and merged with the table and the box on top of it. Also the saucepan in the back is joined with the table surface. The segmentation with more fine-grained supervoxels exposes different problems. Here the tetra-pak is joined with the box it stands on, and also the box nearby it is merged with the table surface. Moreover, there are several single-supervoxel “objects” that were not joined to any other cluster. Finally, both cases share other two problems: the table is split into many pieces, and the bowl on the left is split into two parts.
I think these initial results are rather good, especially considering the fact that a very simple dissimilarity measure is used. In the next blog post I plan to zoom in at the problematic areas and discuss how the dissimilarity function could be improved to solve the observed problems.
Hi! This is my first blog post in the scope of “Segmentation/Clustering of Objects is Cluttered Environments” code sprint. Today I would like to discuss the problem of table-top object segmentation and the tools that PCL currently has to offer.
Consider a situation when a robot observes a table with several objects standing on top of it. Assuming that the table surface is flat and is large relative to the objects’ size, and also that the objects do not touch each other (either standing side-by-side or on top of each other), the following standard simple pipeline allows to break the scene into objects:
PCL offers components to solve each of the mentioned subtasks. After putting them together and tweaking several involved parameters, one could get an object segmentation system that works reasonably well under aforementioned assumptions. The problem, though, is that those assumptions are far too restrictive.
A typical real-world table with objects on top of it will be more challenging. Just have a look at your desk. The main properties are: a) the number of objects is unknown and is potentially very large; b) the objects do not stand in isolation, but rather touch and occlude each other; c) there might be several supporting planes, for example shelves or large objects that support smaller objects.
Unfortunately, there is nothing in PCL that could bring you anywhere close to segmenting such a scene. The available components include:
The first two methods are supposed to break a scene into foreground and background. Therefore unless one knows the number and approximate locations of all objects, these are of no use.
CRFSegmentation, as far as I understand, represents an unfinished work and also requires an initial guess (to which cluster it belongs) for each point in the dataset.
Next five modules implement region-growing segmentation. SeededHueSegmentation requires seeds and considers only Euclidean distance and difference in hue. RegionGrowing considers Euclidean distance and normal orientations, whereas RegionGrowingRGB works with differences in RGB color instead of normals. ConditionalEuclideanClustering may use any user-defined comparison function to judge whether two points belong to one segment. Finally, OrganizedConnectedComponentsSegmentation is similar to the previous one, but is crafted for organized (Kinect-style) point clouds. According to my experiments, none of these modules is able to come up with a reasonable segmentation of a real-world table-top scene.
SupervoxelClustering is the latest addition to PCL. As its name suggests, it can be used to split (oversegment) a point cloud into a set of supervoxels. Th algorithm uses geometrical structure, normals, and RGB values of points. The results I get are very nice and object boundaries are typically preserved (i.e. a single cluster does not span over multiple objects), however further processing is required to merge clusters into objects.
To conclude, there are tools in PCL that one can use to segment trivial table-top scenes. However, there is no module that could segment a complex real-world table-top scene out-of-the-box. Therefore, design of such a module will be the main focus of this code sprint.
As mentioned in the roadmap, one of the goals is to implement a framework that captures vital statistics of selected descriptors and correspondence types. These vital statistics would then be analyzed by one or more objective function(s) to enable scene based optimizations.
The first milestone, a metrics framework for descriptor evaluation is now complete, and its output is in-line with the characteristics cited in Rublee et. al. ICCV 2011 paper, among other publications.
Specifically, the framework computes the intended vital statistics including: 2-Way and multi-descriptor matching and inlier rates. The filter banks include L2-distance, L2-ratio, and uniqueness measure. A simulated ground truth is also implemented and is generated during runtime. The framework has been applied to local 3D descriptors (FPFH33, SHOT352, and SHOT1344) across a range of downsampling leaf-sizes (0.01-0.07) and across a range of in-plane (0-90 degrees) rotations. A sample of the results is illustrated in the bar graphs below, which reflect the various metrics, computed at a 30 degree simulated rotation and at 2 levels of downsampling: 0.01 for the top bar graph and 0.07 for the next one. In total, 1680 rates were generated for further analysis by the objective function(s). A link is included below to a sample extended output for other 3D descriptors. Next step: to extend the framework to support 2D descriptors.
The extended output for other 3D descriptors follows, [click to enlarge]:
The project has started.
Thomas Whelan and John McDonald, two of our collaborators from the National University of Ireland in Maynooth, have been feverishly working away on extensions and improvements to PCL’s implementation of KinectFusion - KinFu.
Two of the major limitations of KinectFusion have been overcome by the algorithm, which Tom calls Kintinuous:
Tom has created some amazing mesh reconstructions of apartments, corridors, lecture theaters and even outdoors (at night) - which can be produced real-time. Below is a video overview showing some of these reconstructions.
We are hoping to integrate Kintinuous into a full SLAM system (iSAM) with loop-closures to create fully consistent 3D meshes of entire buildings. More details including the output PCD files and a technical paper are available here.
As a final blog post for this Toyota Code Sprint, I am attaching the final report I have written for the sponsors.
The presentation slides associated with the report:
And the midterm report:
Just a quick note that the visual odometry library we have been using within our particle filter - called FoVis (Fast Visual Odometry) - has been released. It supports both stereo cameras (including Pt Grey Bumblebee) and depth cameras (MS Kinect, Asus Pro Live) and achieves 30Hz using a single CPU core. I haven’t used another VO system but its highly recommended.
FoVis was developed by Nick Roy’s group here in MIT. More details can be found in this paper:
Visual Odometry and Mapping for Autonomous Flight Using an RGB-D Camera. Albert S. Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox, and Nicholas Roy Int. Symposium on Robotics Research (ISRR), Flagstaff, Arizona, USA, Aug. 2011
So I’ve been doing some analysis to determine the overall benefit of the improvements mentioned below within our application. It will give a feel for the impact of various improvements in future.
High Level Timing
First lets look at the high level computing usage. We did several tests and looked at elapsed time for different numbers of particles and also looked at (1) doing the likelihood evaluation on the GPU (using the shader) or (2) on the CPU (as previously).
The data in this case was at 10Hz so real-time performance amounts to the sum of each set of bars being less than 0.1 seconds. For low number of particles, the visual odometry library, FoVis, represents a significant amount of computation - more than 50 percent for 100 particles. However as the number of particles increases the balance of computation shifts to the likelihood function.
The visual odometry and likelihood function can be shifted to different threads and overlapped to avoid latency. We haven’t explored it yet.
Other components such as particle propogation and resamping are very small fractions of the overall computation and are practically negiligible.
In particular you can see that the likelihood increases much more quickly for the CPU evaluation.
Low Level Timing of Likelihood
The major elements in the likelihood are the rendering of the model view and the scoring of the simulated depths - again using the results of these same runs.
Basically the cost of rendering is fixed - a direct function of the (building) model complexity and the number of particles. We insert the ENTIRE model into OpenGL for each iteration - so there will be a good saving to be had by determining the possibly visible set of polygons to optimize this. This requires clustering the particles and extra caching at launch time but could result in an order of magnitude improvement.
It’s very interesting to see that the cost of doing the scoring is so significant here. Hordur put a lot of work into adding OpenGL Shader Language (OpenGLSL) support. The effect of it is that for large numbers of particles e.g. 900 scoring is only 5% of available time (5% of 0.1 seconds). Doing this on the CPU would be 30%.
For the simplified model, this would be very important as the scoring will remain as is, but the render time will fall a lot.
NOTE: I think that some of these runs are slightly unreliable. For portions of operation there was a charasteristic chirp in computation during operation. I think this was the GPU being clocked down or perhaps my OpenGL-based viewer soaking up the GPU. I’ll need to re-run to see.
I finally managed to put Misha’s Poisson reconstruction implementation in PCL. It now works properly and passes the unit tests, with the same behavior as the original application. Futhermore, we are adapting it to the PCL coding style.
A new pcl::surface base class has been introduced, due to some confusion communicated via the pcl-users mailing lists. Now, there is the CloudSurfaceProcessing class, which represents the base class for algorithms that take a point cloud as an input and produce a new output cloud that has been modified towards a better surface representation. These types of algorithms include surface smoothing, hole filling, cloud upsampling etc.
In this category, we have the MovingLeastSquares algorithm and its additional features we mentioned in previous blog posts and the all-new BilateralUpsampling algorithm, based on the bilateral filter (for more details about the workings of the algorithm, please see the previous post). Suat was kind enough to help me by modifying the OpenNIGrabber such that it will output 1280x1024 PointXYZRGB clouds when the Kinect is set to high-resolution RGB -> this means that each second row and column contains nan depth values. And here is where the BilateralUpsampling comes in, using the RGB info to fill in the missing depth data, as visualized in the following example image:
In this post I’m going to talk about some GPU optimizations which have increased the speed of the likelihood function evaluation. Hordur Johannsson was the one who did most of this OpenGL magic.
Evaluating Individual Measurement to Model Associations
This is the primary approach of this method. Essentially by using the Z-buffer and an assumption of a generative ray-based cost function, we can evaluate the likelihood along the ray rather than the distance in Euclidean space. This is as was previously discussed below, I just mention it here for context.
Computing Per-pixel Likelihood Functions on the GPU using a Shader
The likelihood function we are currently evaluating is a normalized Gaussian with an added noise floor:
Previously this was computed on the CPU by transferring the depth buffer back to the CPU from the GPU. We have instead implemented a GPU shader to compute this on the GPU.
Currently we look-up this function from a pre-computed look-up table. Next we want to evaluate this functionally, to explore the optimization of function’s shape as well as things like image decimation. (Currently we decimate the depth image to 20x15 pixels).
Summing the Per-pixel Likelihood on the GPU
Having evaluted the log likelihood per pixel, the next step is to combine them into a single log likelihood per particle by log summation:
This can be optimized by parallel summation of the pixel images: e.g. from 32x32 to 16x16 and so on to 1x1 (the final particle likelihood). In addition there may be some accuracy improvement by summing simularly sized values and avoiding floating point rounding errors: e.g. (a+b) + (c+d) instead of ((a+b)+c) +d
We’re currently working on this but the speed up is unlikely to be as substantial as the improvement in the previous section.
Single Transfer of Model to GPU
An addition to the above, previously we used the mixed polygon model. We have now transferred to using only a model made up of only triangles. This allows us to buffer the model on the GPU and instead we transmit the indices of the model triangles which should be rendered in a particular iteration. (Which is currently all the triangles).
Future work will look at determining the set of potentially visible polygons - perhaps using a voxel-based indice grid. This is more complex as it requires either implicit or explicit clustering of the particle set.
In addition to the above, we are now using an off screen buffer which allows us to renderer virtual images without being limited by the resolution of the machine’s specific resolution.
Putting it all together
We’ve benchmarked the improvements using a single threaded application which carries out particle propogration (including Visual Odometry), renderering, likelihood scoring and resampling. The test log file was 94 seconds long at about 10Hz (about 950 frames in total). We are going to focus on the increased number of particles for real-time operation with all modules being fully computed.
The biggest improvement we found was using the triangle model. This resulted in about a 4 times increase in processing speed. Yikes!
Using the triangle model, we then tested with 100, 400 and 900 particles the effect of computing the likelihood on the GPU using a shader:
This results in a 20-40% improvement, although we would like to carry out more sample points to verify this. The result of this is that we can now achieve real-time performance with 1000 particles. For the log file we didn’t observe significant improvement in accuracy beyond about 200 particles - so the next step is to start working with more aggressive motion and data with motion blur. There, being able to support extra particles will become important.
In my next post I’ll talk about how that computation is distributed across the various elements of the application.
Time for a new blog post. Lately, I have been working on image-based approaches for solving our smoothing and surface reconstructions problems. A straight-forward, but very effective method I wanted to implement for a long time is the one in:
The idea behind is to use the RGB image in order to enhance the depth image, in a joint bilateral filtering, based on the following formula:
where, in our case, is the depth image and is the RGB image.
The nice thing about it is the fact that we can use the 15Hz mode of the Kinect in order to produce high resolution (1280x1024 px) RGB images and normal (640x480 px) depth images. By using this method, we can obtain 1280x960 depth images. I faced some problems with the registration of depth to high-res RGB images, so the results below show just the case of 640x480 depth and color images.
As you can see, there are a lot of new points in the filtered point cloud (168152 vs 141511 in the input), no noisy points, and their positions are coherent with their neighbors.
Just as a teaser, an example of the high-resolution image-based upsampling (will come back with more results once we solve the registration problem mentioned above):
Also, in the meantime, I have spent a frustratingly large amount of time fixing the Poisson implementation we had in PCL. It seems that there was a memory error in the version a Google Summer of Code participant introduced in PCL, so in the next days we shall try to bring the newer version in.
In the last period, I have concentrated on coming up with new and better upsampling methods for the Moving Least Squares algorithm. Also, a lot of issue on the ones presented last time were solved.
Everything was committed to trunk (along with a complete interface and documentation) and should be included in the next PCL release. The upsampling methods are the following:
A quick timing analysis shows us that the running times are well within the 2 minutes as mentioned in the project requirements. Important to note is the fact that the bulk of the time (~35s) is spent on computing the MLS surface, and about 1-3s is spent on the actual upsampling. Thus, we can conclude that the quality improvements of the upsampling are well worth the additional ~5% increase in execution time.
|Upsampling method||Time(s)||Resulting # points|
A more conclusive test would be to take a Kinect cloud of a wall at a distance where the noise is accentuated (~3m) and try to fit a plane in each of the resulting upsampled clouds. In order to make the experiment more realistic, we took the picture of the wall at an angle, such that the quantization effects would increase along the wall. The numeric results are the following:
|Upsampling method||Cloud # points||% inliers|
Unfortunately these numerical values do not represent the actual quality of the fit, because of the varying point density across the cloud in the different upsampling methods (i.e., the parts of the wall closer to the sensor had a larger density and precision in the original cloud, and as points get farther from the sensor, the sparsity and noise increase; BUT in VOXEL_GRID_DILATION and RANDOM_UNIFORM_DENSITY, the density is constant across the cloud, meaning that the noisy part of the wall has the same amount of points as the more precise part).
As such, in order to analyze the quality of the fit, we do a visual analysis of the inliers/outliers ratio, as shown in the following pictures:
Original cloud and its plane inliers
NONE cloud and its plane inliers
SAMPLE_LOCAL_PLANE cloud and its plane inliers
RANDOM_UNIFORM_DENSITY cloud and its plane inliers
VOXEL_GRID_DILATION and its plane inliers
The conclusion is that the VOXEL_GRID_DILATION method behaves the best, as it has the least holes out of all the options.
So this is a wrap-up for the MLS smoothing. Next, I shall be looking into image-based hole-filling and how this can be applied to our problem. This will involve some experiments using MatLab and adding some code into the PCL I/O framework.
An overview of the system we are developing is illustrated at the bottom of this post. Sensor output in illustrated in lemon color, although only RGB-D data is considered in this project. Modules we have been actively developing are colored blue. The module in orange is our Incremental Motion Estimation algorithm. This module current utilizes a feature-based visual odometry algorithm. One alternative to this approach would be Steinbrucker et al. (ICCV 2001) which was developed specifically for RGB-D.
We have also illustrated in pink three modules which would naturally augment the point cloud-based localization system by providing redundancy and recovery methods. We have been developing these modules independently of PCL. They are illustrated here to demonstrate the flexibility of an SMC-based approach to localization.
Some typical VO performance for motion along a corridor. The boxes are 5m in size. We used a Kinect and the Freenect driver for this.
I’m currently finalizing a particle filter which used Eigen Isometry3d and boost. My old version used GSL for random number generation - which I want to move away from.
With some very good advice from Zoltan and a lot of hacking, we now have 3 upsampling methods for the MLS algorithm.
No additional points are created here. The input pointcloud is projected to its own MLS surface. This is exactly what we previously tested and presented in a recent blog post.
For each point, sample its local plane by creating points inside a circle with fixed radius and fixed step size. Then, using the polynomial that was fitted, compute the normal at that position and add the displacement along the normal. To reject noisy points, we increased the threshold for the number of points we need in order to estimate the local polynomial fit. This guarantees that points with a ‘weak neighborhood’ (i.e., noise) do not appear in the output.
And a few results:
The first picture is the reconstruction of coke bottles on a table. Please notice the correction for the quantization effects. The table surface is now planar and the objects look ‘grippable’. The second picture is a similar scenario, now using tupperware. The last picture shows how well the door handle is reconstructed. We conclude that visually, the results are very good.
An immediate problem is that this method adds the same amount of new samples to all points, not taking into account the local point density. An improvement we can make on this approach is to filter it with a voxel grid in order to have a uniform point density. Of course, this operation is superfluous, and is useful just for memory saving (see picture below for comparison between upsampled and upsampled + voxel grid).
Take as input a desired point density within a neighborhood with a fixed radius. For each point, based on the density of its vicinity, add more points on the local plane using a random number generator with uniform distribution. We then apply the same procedure as for 2 to project the point to the MLS surface.
The results are satisfying. As compared to 2, we do not need to apply the expensive voxel grid filter anymore. An issue might be the fact that, because we generate the points using a random number generator, the output point cloud looks a bit messy (as compared to 2, where the points are generated on a grid determined by the step size), but the surface is still well preserved. Also, the time performance is poor because of the rng.
This method makes sense theoretically, but in practice we are having serious issues optimizing it to fit into main memory. The idea behind it is to take each point pair within a fixed radius neighborhood and to sample the line connecting these two points. Ideally, this would fill up any small holes inside the cloud. The downside is that it also creates a lot of additional points in already dense areas. Handling this elegantly is something we need to think about.
In this blog post, we shall inspect the mesh construction methods available in PCL.
The algorithm was first presented 25 years ago in:
- William E. Lorensen, Harvey E. Cline: Marching Cubes: A high resolution 3D surface construction algorithm. In: Computer Graphics, Vol. 21, Nr. 4, July 1987
In PCL, these are implemented in the MarchingCubes class with the variants MarchingCubesGreedy and MarchingCubesGreedyDot. The ‘greedy’ comes from the way the voxelization is done. Starting from a point cloud, we create a voxel grid in which we mark voxels as occupied if a point is close enough to the center. Obviously, this allows us to create meshes with a variable number of vertices (i.e., subsample or upsample the input cloud). We are interested in the performance of this algorithm with the noisy Kinect data. Time-wise, the algorithm ran in about 2-3 seconds for a 640x480 cloud.
The following figure shows the results for various leaf sizes (from left to right, bottom to top: leaf size of 0.5 cm, 1 cm, 3 cm, and 6 cm, respectively):
And a close-up on the highest resolution cloud:
We conclude that the results are not satisfactory, as the upsampling is ‘artificial’ and does not inherit the properties of the underlying surface. Furthermore, there is no noise-removal mechanism and the blocking artifacts are disturbing.
Naive Algorithm for Organized Point Cloud Triangulation
This algorithm is implemented in the OrganizedFastMesh class in PCL. The idea behind is very simple: it takes each point in the inherent 2D grid of the Kinect clouds and triangulates it with its immediate neighbors in the grid. One can quickly understand that NaN points (points that were not captured by the sensor) will result in holes in the mesh. This is a mesh construction method and will output a mesh with exactly the same vertices as the input cloud. It does not take care of noise or NaN values in any way.
A screenshot of the output can be seen in the following figure. Visually, the result is decent, considering that the processing time is extremely small - just a single pass through all the points of the clouds.
We have managed to collect all the datasets required by Toyota. For a complete description, please visit the following link (also accessible from my main page).
Programming-wise, we have spent time fixing bugs and beautifying the pcl_surface module. After I will finish my exams next week, I shall start looking into implementing some new algorithms.
So as to quantify the benefit of using our depth image simulation mode for localization, we need to do some benchmarking. There are a number of parameters we need to test:
And some metrics:
Where error is measured using LIDAR-based ground truth.
In addition to this we also developed a second type of likelihood function. This method is essentially equivalent to ICP - without the iterative. Each pixel’s location is compared to the nearest point in euclidean space.
In this figure we try to illustrate some examples of where the scores would be substantially different. So we have an RGB-D sensor sensing two depth samples (purple) something like the end of a corridor (green, from above):
For the depth image simulation method (blue, Ray-to-Plane) the likelihood would be formed by comparing with the difference in depth (blue star). For the ICP-type method (red, point-to-plane) the distance can be very different. For the lower case the distances are approximately the same. For the upper case, the association is very different. Also as the ICP-type method requires searching over the entire set of planes for the explict correspondence and it is also quite expensive.
We wanted to test what effect this choice has on accuracy. Below are some figures showing the results across an extensive experimentation. The figures were produced with my 2 year old 4-core 2.53GHz Pentium Core2 with an Nvidia Quadro 1700M with 32-cores. Each result is the average of 20 independent Monte Carlo runs. Total testing is equivalent to 16 hours of runtime.
First of all, error broadly converges to about the same accuracy as the number of particles increases:
The 50cm figure is largely because we aim to maintain multi-modal estimate - so as to be more robust when tracking. To get the optimal performance for each frame you could use ICP using the most likely particle as the starting condition. We haven’t done that here.
This figure shows the timing performance as the number of particles increases:
OUR implementation of the nearest plane lookup was pretty slow. However our target framerate of 10Hz was achieved with 100 particles for the depth image likelihood function (ray-to-plane). As you saw above, for 100 particles the error has typically converged, so thats been sufficient to be realtime with the type of motion you see in the figure below.
For all of these figures we decimated the original RGB-D image by a factor of 32. Next we would like to look at what effect that has - especially now that my new laptop has 256 GPU cores.
Additionally we want to look at sub-dividing the full model, so that only the planes near to the particle poses [within 20m] are passed to OpenGL for testing. This is likely to give us a substantial performance improvement. I believe that 1000 particles at 10Hz should be achieveable.
Interestingly the approach developed by Ryohei Ueda during his internship at PCL/Willow Garage is very close to what we do here: the tracking paradigm is essentially inverted. Later in this project it would be interesting to apply depth image simulation method to his tracking algorithm and see if it can be speeded up. Given the object models he was using are much smaller than a building model, it should do.
For the VTK smoothing tests, we took the raw clouds, triangulated them using the OrganizedFastMesh triangulation with the TRIANGLE_ADAPTIVE_CUT option, and then fed this to the 3 smoothing algorithms from the VTK library.
The first one to be tested is MeshSmoothingLaplacianVTK with the default parameters recommended by VTK, but with an increase on the number of iterations from 20 to 100.
Here, the results are satisfactory, in both cases, the quantization artifacts are reduced (they are still visible).
Also, if we look at the corresponding mesh, the reconstruction after smoothing looks more natural, with a better surface curvature.
Bottles and Tupperware Datasets
In this case, the Laplacian smoothing does not work well anymore. The quantization and the high noise level is still present in the case of both the bottles and tupperware datasets. The main reason for this is the fact that the objects of interest were quite far away from the sensor and the quantization artifacts are quite accentuated (i.e., there are large gaps between the points belonging to the same object).
The mesh subdivision schemes we have been provided by the VTK library are not of great use for our scenarios, as they just split up triangles in the mesh, inheriting from their artifacts. Furthermore, these schemes are highly dependent on the quality of the initial triangulation - which in our case is the simple OrganizedFastMesh - does not yield excellent results. They basically just resample point on the triangles present in the input mesh, without taking into consideration any more complex information about the vertex neighborhood.
Another thing we tried was to combine the simple subdivision with the previous laplacian smoothing, and the results are visually decent, as shown in the next figure. Again, we inherit the problems of the subdivision scheme (the holes caused by the incorrect triangulation).
In the meantime, I have worked on solving some issues with Zoltan Marton’s Greedy Projection Triangulation. Two trac issues regarding this were solved, but its current state does not allow us to reconstruct Kinect scans - once we solve this, I will do benchmarking on the gp3 algorithm too. Other time-consuming fixes were done for OrganizedFastMesh.
A direction we definitely need to look into is to have some algorithms that also add points during reconstruction. The original MLS and GP3 papers do mention this possibility, but they have not been implemented in PCL yet. It is clear so far that we still do not have the Holy Grail of smoothing yet.
After we have collected part of our datasets of interest (there are still some objects missing from our collection, will get them next week), we proceed in testing our available smoothing algorithms. Please note that these tests use only real sensor data of scanned objects that are rather irregular, so we do not have any ground truth for our benchmarks. As such, we will limit ourselves just to a visual inspection of the results. This inspection will look mostly into sensor artifacts that we might have in the clouds after the algorithms were applied (please see the problem description page for more details) or artifacts caused by the algorithm itself (issues such as over-smoothing).
One of the best algorithms we currently have in the PCL library is the MovingLeastSquares implementation. We ran this algorithm on the bed_sheets dataset and tweaked the parameters to see the situations it creates.
The first image, from left to right:
The results seem satisfactory, in general. MLS removes some of the quantization effects (note that the bed was at about 1.5-2m away from the camera), although the slices are still clearly visible. Due to the fact that the details in some wrinkles were lost using a 5 cm smoothing radius, we also tried a 3 cm radius, which seemed to reduce the over-smoothing effect.
The second image, left to right:
Here, we show that the usage of polynomial fitting in the MLS algorithm is useful for preserving sharp edges. One can see that the image in the middle is over-smoothed with the 5 cm radius, but the ridges are preserved in the third image.
MLS was applied to the tupperware dataset and obtained the following results.
Both images, from left to right:
On on hand, MovingLeastSquares seems to group points together and form visible ‘long holes’. This is due to the heavy quantization errors introduced by the sensor - the table and the curtains in the back are at about 2.5-4m from the camera.
On the other hand, it clearly improves the shape of the objects. The second figure shows a top-down view of the table. The tupperware seems much more smoother and grippable, without loss of information.
In the list of objects we are interested in, there are transparent glasses/mugs. Unfortunately, the PrimeSense technology proves incapable of recording ANY depth for the points corresponding to the glasses, as shown in the following image. There is nothing a surface reconstruction algorithm can do in order to recreate the points on the glasses, so we shall discard this dataset in our following benchmarks.
As expected, the transparent parts of the plastic bottles have not been recorded by the depth sensor.
The image below, from left to right:
The result is very satisfactory. MLS does NOT add any points in the reconstruction, but one can notice the very good silhouette of the bottles, as compared to the very noisy input.
As required by Toyota, we started recording a series of typical household scenes. This first post shows the first 23 recordings we did using an Asus Xtion Pro camera. One can easily download them by the following command:
svn co http://svn.pointclouds.org/data/Toyota
Those datasets are mainly meant to represent realistic situations that a personal robot might face in an undirected human environment. All of the scenes are recorded starting from a distance of about 3-4 meters from the main subject and getting close and rotating around it, in order to simulate the behavior of a robot and to capture most of the artifacts that the PrimeSense cameras present.
These are split into the following categories:
Bed Sheets - 3 styles of bed sheets in bedrooms:
Bottles - 2 layouts on a table in the kitchen
Door Handles - 5 styles of indoor/outdoor door handles
Glasses - one recording for opaque mugs and one for transparent glasses in the kitchen
Keyboards - 4 different laptop keyboards on an office desk
Shoes - 2 recordings
Tupperware - 3 recordings of tupperware on the kitchen table
Other - 2 other recordings I found interesting for the point cloud smoothing problem
With the help of Michael and Radu, we have made a few changes to the pcl::surface module. We have now structured it by adding three base classes which differentiate between algorithms with distinct purposes:
Please notice the new classes ending with VTK. We already had these implemented in PCL before, but in quite a simple state. They are now fully usable and documented.
The recordings for the required datasets is in progress and they will be tested with most of the algorithms mentioned above.
Also, a new Poisson implementation is underway.
I have not been too active lately due to intense school activities (exams and end of semester projects/presentations). I am now ready to continue with my TOCS assignments.
A couple of weeks ago, some discussions took place between Toyota and PCL representatives and my project got a bit more clearer. The things I am going to spend my following days on is creating a database of recordings of different household items and household-specific scenes. Next, I shall apply all the current algorithms we have in PCL for surface smoothing and reconstruction and report back with the results of a qualitative analysis of the output.
After some prodding from Christian at Willow, we fixed a few bugs with our coordinate frames. Thanks! Applying the camera transform to the point clouds now results in perfect registeration of two views. This is three views of a teapot without noise or quantization:
And with noise and quantization:
Basically the bugs came about when we inverted the Z axis when reading the depth buffer making our system left handed. (Stop hating on us leftys!) This is the coordinate frames we’re using now.
We’re just about ready to add shader-based cost functions.
As mentioned in the roadmap, one of the steps would be to implement a method that would help find edges in the depth images. The one I started looking into was proposed by John Novatack and Ko Nishino in “Scale-Dependent 3D Geometric Features”. The final goal of the paper is to have scale-dependent 3D feature descriptors. But on the way, they compute edges and corners in 3D.
The novelty of the approach is that they compute the scale-space of a dense and regular 2D representation of the surface using the normals of the scan. Technically, they create the Gaussian pyramid of the “normal images” and also the first and second derivative (i.e. Laplacian) of the levels of this pyramid.
Just like in 2D computer vision, the edges are found by looking for the zero-crossings of the Laplacian of the normal maps at different scales (+ some thresholding on the corresponding first derivative).
An example of the pyramid of normal maps:
Edges found in an example scene:
More results will be posted once this is finished.
We are getting pretty close to having a complete RGB-D simulator integrated into PCL. Below you can see some figures showing:
Note the disparity-based quantization and the Gaussian noise. A fully realistic simulator will be much more complicated though! The images correspond to a 3D model of our Stata Center building which we have, at about this location:
In addition here is the RangeImage using Bastian Steder’s work - which we’ve integrated:
For some reason it appears very small, hence the low resolution (to-be-fixed). We haven’t done much else but feature extraction, segmentation registeration should be possible and it could be useful for unit testing and stochastic
We have a GLUT-based application takes as input a single .ply file and the user can use a mouse to ‘drive around and take shots. For the really interested, you can try out our range range-test program in pcl/simulation. Perhaps its useful to people who have had problems using OpenNi. We are re-writing it using VTK currently.
Here’s the sample ply file to use from MIT’s Stata Center:
We’ve been talking with Alex about combining efforts - towards a library for point cloud simulation. Hordur Johannsson has been looking at using OpenGL Shaders to do comparison between these types of simulated views and real data - to give a measure of a match between to images.
NOTE: the models were generated by students working with Professor Seth Teller. More details and models here:
In the past days, I have been doing some research in the literature regarding point cloud upsampling and smoothing, trying to find some approaches that might be suitable with the PrimeSense cameras. Will produce a blogpost regarding this as soon as I have done some conclusive experiments.
Until then, the following figure shows the current status of the Virtual Scanner application. It now has a GUI written in VTK, where the user can load VTK-compatible objects, freely manipulate a camera and produce 3D scans of the scene. The scanned cloud is shown live in another window.
I have tried to make the artifacts of the output cloud to be similar with the ones produced by the Kinect. The solution for the quantization artifacts was suggested by Suat and it consists of the following:
There are still some interface issues to be solved, and this will be commited to trunk soon.
I started off by doing some experiments with one of the smoothing algorithms we already have implemented in the PCL library: Moving Least Squares smoothing.
First, the influence of the search radius on the smoothing output was analyzed (without fitting a polynomial). The following figure shows the results: from bottom to top, left to right: original kinect cloud, MLS with search radii: 0.01, 0.02, 0.05; color coding by curvature.
Best result that smooths the wall to a plane and keeps the shapes of the objects is obtained with search_radius of 0.02 (=2 cm). 0.01 does not perfectly smooth the wall and 0.05 eliminates the depth of the small figure on the desk.
The time performance was looked into and the collected data is presented in the following table. Two different search approaches were used: the kdtree implementation from FLANN and the search::OrganizedNeighbor class using a window-based approach (approximate method).
|MLS search_radius||KdTreeFLANN||OrganizedNeighbor||Time improvement|
|0.01||5.6 s||3 s||46 %|
|0.02||19.1 s||10.8 s||43 %|
|0.05||107 s||61.8 s||42 %|
|0.07||201.3 s||117.3 s||42 %|
So, the immediate conclusion is that MLS is definitely not suited for real-time application. It would be a viable option as a post-processing step for the registration pipeline we mentioned in the roadmap.
Next, we varied the order of the polynomial to be fitted. The following figure shows the results: MLS with polynomial fitting of orders 0, 1, 2, and 3 with a constant search radius of 0.02 (ordered left to right, bottom to top).
The result differences are rather subtle, some fine details tend to be preserved with higher order polynomial fitting. But these fine details are mostly due to noise and the time expenses one has to pay for the additional polynomial fitting is not totally worth the small improvements, as the following table shows:
|MLS polynomial order||Time||Increase to order 0|
|0||18.4 s||0 %|
|1||19.4 s||5 %|
|2||22.5 s||22 %|
|3||25.8 s||40 %|
The first major step is to port over our work for generating simulated range images from a global world view. It uses OpenGL to render simulated image views in much the same way as it would for a gaming application. The implementation has a few extra bells which mean that arrays of (smaller) views can be read efficiently. This allowed us to achieve 100s of simulated views at 10s fps.
When this is fully working, there is a test progam which will allow the user to “drive” around a simulated world and generate a log of RGB-D data.
The maps we were using previous used a pretty funky file format: basically each plane was read from an individual PCD file - so litterally 1000s of files were read in to build the map. The next step in our work is to enable support for obj, vtk, ply file types. Thankfully PCL already has good support for reading these files.
Our contribution to the Toyota Code Sprint will be focused on Global Point Cloud Localization.
We’ve been working on this problem previously using RGB-D/Kinect sensors. You can see an overview of our work over on the main part of pointclouds.org by clicking here
This video provides an overview of the method:
The problem is summarised as localizing the sensor/robot/vehicle within a prior map of points or higher level objects (plane, curves or even building models). Its common to use probabilistic methods such as Sequential Monte Carlo to propogate a belief in a certain state.
What prior work is there on this domain? Well for one, the Google Autonomous car is localized in this way!
Related to this we hope to extend PCL support for simulated sensors as well as exploring optimization and improvement of these probabilistic methods. In terms of simulation, the addition of various noise models is also important given the various wacky effects that we see with the Kinect/Xtion/Primesense devices.