The Honda Research Institute code sprint was finished. All code was commited, the final report is attached below.
We’ve implemented almost everything, what was planned, and now we want to present our results.
First of all please watch this video with results on the whole Castro dataset. Road is marked with red, left and right road borders are indicated by green and blue lines respectively.
The main problem is the presence of holes in the road surface. This is caused by the holes in the input disparity maps. We decided not to inpaint them, because we have no information about scene in those points. But we will compute the quality of the labeling only in points with known disparity. It allows to estimate the results of our method independently of the quality of the method for generating disparity maps.
Our method has a significant advantage in situations with one or both curbs are clearly visible (with corresponding sidewalks). You can compare the result of the previous sprint’s method (left) with the result of our method (right) on the same frame (which has two clearly visible sidewalks).
Next, I’m going to show you the numerical results. Precision is a ratio of right detected road’s points to all detected pixels, recall is a percent of detected road’s points. Only points with known disparity are taken into account.
We have implemented an algorithm which processes frames independently (i.e. without the connection to the previous frame). Also, now we make an assumption that both of the curbs (left and right) are presented in the scene.
Below you can see a projection of the labeled DEM to the left image. Green points correspond to the left sidewalk, blue - to the right one. Red points mark the road surface. The algorithm couldn’t find the right curb on this image, so right side of the road was labeled uncorrectly. The good news is that the left curb was detected correctly.
However our goal is to label a road on the image, not on the DEM. So, if we mark each pixel with label corresponding to the DEM’s cell we get the following labeling of the road surface:
You can see a lot of holes in the road area. They caused by holes in the disparity map. We decided not to fill them, because someone/something can be situated there (we have no information).
A disparity map of this frame is shown below. Points without disparity are marked with red.
Hello everybody. Last few weeks I was trying to train an SVM for car recognition. For this purpose I was using some clouds that I had. These were the clouds of the city of Enschede, Netherlands, that I had manually labeled earlier. Training set consists of 401 clouds of cars and 401 cloud of the other objects (people, trees, signs etc.). As for the classifier, I was using Support Vector Machine from the libSVM library.
During the training I was using 5-fold cross validation and the grid search in order to get the best values of gamma and soft margin C (parameters of the Gaussian kernel). The best accuracy achived during cross validation was 91.2718% with Gamma and C equal and respectively.
The model obtained after training was then used for recognition. The set for recognition consists of the 401 cars and 401 other objects. Training and testing sets were taken randomly from different scanned streets. The best accuracy achived this far when trying to reconize test set is 90.7731% (728 correctly recognized objects of 802).
As for descriptors, I was using combination of RoPS feature and some global features such as height and width of the oriented bounding box. RoPS feature was calculated for the center of mass of the cloud with the support radius big enough to include all the points of the given cloud.
Since RoPS is better fits for the purpose of local feature extraction, I believe that using it with ISM and Hough Transform voting will result in higher accuracy.
Hello everybody. I’d like to thank Yulan Guo, one of the authors of the RoPS feature, for his help. I’ve tested my implementation against his and got the same results. I have also tested my implementation for memory leaks with the VLD and it works fine, no memory leaks were detected. Right now the code is ready for commit. And as always I have wrote a tutorial about using the code. Right now all is left is to discuss where to place the implemented code.
Hello everybody. I have finished implenting the ROPS feature. Next step I am going to do is to write to the authors and ask them about some samples of data and precomputed features, so that I could compare the result. After that I am planning to test ROPS feature for object recognition. For this purpose I am going to use Implicit Shape Model algorithm from PCL.
We are pleased to report, that labeling of the Castro dataset (6440 frames) is finished. Here are some examples of labeled images:
We also tested an algorithm, developed by Alex Trevor in the previous HRI code sprint. This algorithm segments points using their normal distribution, this makes it very sensitive to noise.
Basically, this algorithm computes a disparity for a stereo pair using its own dense matching method, implemented by Federico Tombari. But I additionally tested it using disparity maps precomputed by HRI. Here are the typical results (left - disparity is computed with Federico Tombari’s method, right - precomputed by HRI):
You can see that Federico Tombari’s method is friendlier to the normal-based algorithm. But it is not good enough for describing a scene, there are a lot of false positives.
Some noise is presented at the HRI’s disparity maps, even a lot of pixels have no valid disparity, sometimes there are no segments that are similar to road and there are a lot frames in which road was not found.
This algorithm has thresholds for disparity and doesn’t mark as “road” any point which doesn’t satisfy these thresholds. I didn’t take it to account because it would make these results not completely correct. Therefore, 50% recall would be a very good result.
The goal is find all pixels, that belong to the road. Total results are on the image below (precision is a ratio of right detected road’s points to all detected pixels, recall is a percent of detected road’s points):
Hello everybody. I finally finished the code for simple features. I’ve implemented a pcl::MomentOfInertiaEstimation class which allows to obtain descriptors based on eccentricity and moment of inertia. This class also allows to extract axis aligned and oriented bounding boxes of the cloud. But keep in mind that extracted OBB is not the minimal possible bounding box.
The idea of the feature extraction method is as follows. First of all the covariance matrix of the point cloud is calculated and its eigen values and vectors are extracted. You can consider that the resultant eigen vectors are normalized and always form the right-handed coordinate system (major eigen vector represents X-axis and the minor vector represents Z-axis). On the next step the iteration process takes place. On each iteration major eigen vector is rotated. Rotation order is always the same and is performed around the other eigen vectors, this provides the invariance to rotation of the point cloud. Henceforth, we will refer to this rotated major vector as current axis.
For every current axis moment of inertia is calculated. Moreover, current axis is also used for eccentricity calculation. For this reason current vector is treated as normal vector of the plane and the input cloud is projected onto it. After that eccentricity is calculated for the obtained projection.
Implemented class also provides methods for getting AABB and OBB. Oriented bounding box is computed as AABB along eigen vectors.
At first, we decided to implement some simple features such as:
I’ve already started implementing them. After this step I will implement some more complex descriptors (e.g. 3D SURF, RoPS - Rotational Projection Statistics). And finally I’m going to use machine learning methods for the object recognition.
For evaluation of the developed algorithm we need ground truth so we decided to outsource manual labelling. For this purpose a highly efficient tool was developed. A person could easily solve this task however there are some difficulties.
First of all, what to do, if there are several separated roads in the scene? The solution is to mark as “road” only the pixels of the road, on which the vehicle is. Below is an example of such frame (left) and the labeling for it (right).
How to label an image if two different roads on the previous frames are merging in the current frame? We decided to mark pixels of the first road and the second road pixels lying above the horizontal line, drawn through the end of the curb that separates the roads. Here an example for explanation:
We optimize manual labelling time in 10 times in contrast to our initial version and now we could obtain enough labelled data in reasonable time. All results will be publicly available later.
The project “Fast 3D cluster recognition of pedestrians and cars in uncluttered scenes” has been started!
The project “Part-based 3D recognition of pedestrians and cars in cluttered scenes” has been started!
A few words about the project: the goal is to detect a drivable area (continuously flat area, segmented by a height gap such as curb). As an input we have two rectified images from the cameras on the car’s roof and a disparity map. An example of such images is below.
The point cloud, that was computed from them:
The point cloud is converted into the Digital Elevation Map (DEM) format to simplify the task. DEM is a grid in column-disparity space with heights associated to each node. A projection of DEM onto the left image is illustrated below.
On the following image you can see that despite of a low resolution of the DEM it is still possible to distinguish the road from the sidewalk.
A front view of the DEM in 3D space (nodes without corresponding points, i.e. the disparity map had no points, that should be projected onto this node, are marked with red):
DEM as a point cloud:
As a starting point we are going to implement an algorithm: J. Siegemund, U. Franke, and W. Forstner, “A temporal filter approach for detection and reconstruction of curbs and road surfaces based on conditional random fields,” in Proc. IEEE Intelligent Vehicles Symp., 2011, pp. 637-642.
All source code related to the project can be found here.
The Honda Research Code Sprint for ground segmentation from stereo has been completed. PCL now includes tools for generating disparity images and point clouds from stereo data courtesy of Federico Tombari, as well as tools for segmenting a ground surface from such point clouds from myself. Attached is a report detailing the additions to PCL and the results, as well as a video overview of the project. There is a demo available in trunk apps, as pcl_stereo_ground_segmentation.
This week I committed the code to perform face detection in PCL trunk and wrote the final report summarizing the work done as well as how to use the module.
In the last months I have been working on a new meta-global descriptor called OUR-CVFH (http://rd.springer.com/chapter/10.1007/978-3-642-32717-9_12) that as you can imagine is an extension to CVFH which is in its turn an extension to VFH (i am not very original at dubbing things). I have also commited some tools and pipelines into pcl/apps/3d_rec_framework (still unstable and not very well documented).
Tomorrow we are having a TV demo in the lab where we are showing recent work on recognition/classification and grasping of unknown objects. So, as happens usually, I had to finish some things for it and I would like to show how with OUR-CVFH is it possible to do scale invariant recognition and 6DOF pose estimation + scale. The training objects are in this case downloaded from 3d-net.org (unit scale, whatever unit is) and they usually do not fit the test objects accurately.
Apart from this, I have also extended OUR-CVFH to use color information and integrate in the histogram. Basically, the reference frame obtained in OUR-CVFH is used to create color distributions depending on the spatial distribution of the points. To test the extension, I did some evaluations on the Willow Garage ICRA 11 Challenge dataset obtaining excellent results (about 99% precission and recall). The training dataset is composed of 35 objects and the test set with 40 sequences totalling 435 object instances. A 3D recognition pipeline method based on SIFT (keypoints projected to 3D) obtains about 70% in such a dataset (even though the objects present texture most of the time). Combining SIFT with SHOT and merging the hypotheses together, gets about 84% and the most recent paper on this dataset (Tang et al. from ICRA 2012) obtains about 90% recall at 99% precission. If you are not familiar with the dataset, here are some screenshots and the respective overlayed recognition and pose estimation.
The color extension to OUR-CVFH and the hypotheses verification stage are not yet in PCL but I hope to commit them as soon as possible, probably after ICRA deadline and before ECCV. You can find the Willow ICRA challenge test dataset in PCD format at http://svn.pointclouds.org/data/ICRA_willow_challenge.
I am back from “holidays”, conferences, etc. and today I started dealing with some of the concerns I pointed out in the last email, mainly regarding the memory footprint required to train. The easiest way to deal with that is to do bagging on each tree so that the training samples used at each tree are loaded before start and dismissed after training a specific tree. I implemented that by adding an abstract DataProvider class to the random forest implementation which is specialized depending on the problem. Then, when a tree is trained and a data provider is available, the tree requests training data to the provider, trains and discards the samples.
I also realized that most of the training data I have for faces, contains a lot of NaNs except of the parts containing the face itself and other parts of the body (which are usually localized in the center of the image). So, to reduce further the data in memory, the specialization of the data provider crops the kinect frames discarding regions with only NaN values.
With this two simple tricks, I am able to train each tree in the forest with 2000 random training samples (from each sample 10 positive and 10 negative patches are extracted) requiring only 3GB of RAM. In case that more training data is needed or the training samples become bigger, one might use a similar trick to design an out-of-core implementation where the data is not requested at tree level but at node level and only indices are kept into memory.
I also found some silly bugs and now I am retraining... let’s see what comes out.
I have continued working on the face detection method and added the pose estimation part, including the clustering step mentioned on my last post. See the video for some results from our implementation (at the beginning is a bit slow due to the video recording software, then it gets better).
I fixed several bugs lately and even though the results start looking pretty good I am not yet completely satisfied. First I was facing some problems during training regarding what to do with patches where the features are invalid (division by zero), I ended up using a tree with three branches and that worked better although I am not yet sure which classification measure should be used then (working on that). The other things are: use of normal features which can be computed very fast with newer PCL versions on organized data and a modification on the way the random forest is trained. Right now it requires all training data to be available in memory (one integral image for each training frame or even four of them if normals are used). This ends up taking a lot of RAM and restricts the amount of training data that can be used to train the forest.
Has been some time since my last post. Was on vacation for some days, then sick and afterwards getting all stuff done after the inactivity. Anyway, I have resumed work on head detection + pose estimation reimplementing the approach from Fanelli at ETH. I implemented the regression part of the approach so that the trees provide information about the head location and orientation and did some improvements on the previous code. I used the purity criteria in order to activate regression which seemed the most straightforward.
The red spheres show the predicted head location after filtering sliding windows that reach leaves with high variance and therefore, are not accurate. As you can see there are several red spheres at non head locations. Nevertheless, the approach relies on a final bottom-up clustering to isolate the different heads in the image. The size of the clusters allows to threshold head detections and eventually, remove outliers.
I hope to commit a working (and complete) version quite soon together with lots of other stuff regarding object recognition.
I recently recieved stereo data and disparity maps to work with for this project, so I wrote a tool to convert the disparity maps to PCD files. The provided disparity data has been smoothed somewhat, which I think might be problematic for our application. For this reason, I also produced disparities usign OpenCV’s semi-global block matching algorithm, which produces quite different results. You can see an example here:
Above is the left image of the input scene. Note the car in the foreground, the curb, and the more distant car on the left of the image.
Above is a top-down view of a point cloud generated by OpenCV’s semi-global block matching. The cars and curb are visible, though there is quite a bit of noise.
Above is an image using the provided disparities, which included some smoothing. The curb is no longer visible, and there is also an odd “ridge” in the groudnplane starting at the front of the car. I think this will be problematic for groundplane segmentation. Both approaches seem to have some advantages and disadvantages, so I’ll keep both sets of PCDs around for testing. Now that I have PCD files to work with, I’m looking forward to using these with my segmentation approach. Prior to using stereo data, I developed segmentation for use on Kinect. I think the main challenge in applying this approach to stereo data will be dealing with the reduced point density and greatly increased noise. I’ll post more on this next time.
Because of ICRA and the preparations for the conference I was quite inactive the last couple of weeks. This week I had to finish some school projects and yesterday I resumed work on face detection. Because the BoW approach using SHOT was “slow”, I decided to give it a try to Decision Forests and Features extracted directly from the depth image (similar to http://www.vision.ee.ethz.ch/~gfanelli/head_pose/head_forest.html). Although not yet finished, I am already able to generate face responses over the depth map and the results look like this:
Basically, the map accumulates at each pixel how many sliding windows with a probability of being a face higher than 0.9 include each specific pixel. This runs in real time thanks to the use of integral image for evaluating the features. For the machine learning part, I am using the decision forest implementation available in the ML module of PCL.
Here are some results regarding 3D face detection:
To speed things up I switched to normal estimation based on integral images (20fps). I found and fixed some bugs regarding the OMP version of SHOT (the SHOT description was being OMP’ed but the reference frame computation was not) and now I am getting almost 1fps for the whole recognition pipeline. Was not convinced about the results with RNN so I decided to give Kmeans a try (no conclusion yet regarding this).
Code is far from perfect (there are few hacks here and there) but the results start looking decent. Regarding further speed-ups, we should consider moving to GPU, however, porting SHOT might not be the lightest task. Next steps would be stabilizing the code, do some commits and then will see if we go for GPU or move forward to pose detection and hypotheses verification.
Yesterday and today, I have been working on face detection and went ahead implementing a Bag Of Words approach. I will summarize briefly the steps I followed:
Right now, I am not doing any post-processing and just visualizing the first 5 candidates (those sliding windows with higher similarity). You can see some results in the following images. Red points indicate computed features (that are not found in the visualized candidates) and green spheres those that vote for a face within the candidates list. Next step will be some optimizations to speed up detection. Now is taking between 3 and 6s depending on the amount of data to be processed (ignore points far away from camera and downsampling).
This week I have been working on a publication which took away most of my time. However, we found some time to chat with Federico about face detection. We decided to try next week a 3D features classification approach based on a bag of words model. Such an approach should be able to gracefully deal with intraclass variations and deliver regions of interest with high probability of containing faces on which we can focus the most expensive recognition and pose estimation stage.
Hi again, first I need to correct myself regarding my last post where I claimed that this week I would be working on face detection. The experiments and tests I was doing on CVFH ended up taking most of my time but the good news are that I am getting good results and the changes increased the descriptiveness and robustness of the descriptor.
Mainly, a unique and repeatable coordinate frame is built at each CVFH smooth cluster (I also slightly modified how the clusters are computed) of an object enabling a spatial description of the object in respect to this coordinate frame. The other good news are that this coordinate frame is also repeatable under roll rotations and thus can substitute the camera roll histogram which in some situations was not accurate/resolutive enough yielding several roll hypotheses that need to be further postprocessed and inevitably slow down the recognition.
This are some results using the first 10 nearest neighbours, pose refinement with ICP and hypotheses verification using the greedy approach. The recognition time varies between 500ms and 1s per object, where approx 70% of the time is spent on pose refinement. The training set contains 16 objects.
A couple of scenes avoiding pose refinement stage where it can be observed that the pose obtained aligning the reference frame is accurate enough for the hypotheses verification to select a good hypothesis. In this case, the recognition time varies between 100ms and 300ms per object.
I am pretty enthusiatic about the modifications and believe that with some GPU optimizations (mainly regarding nearest neighbour searches for ICP and hypotheses verification) a real time (at least, almost) could be implemented.
Regarding the local pipeline, I implemented a new training data source for registered views obtained with a depth device. In this case, the local pipeline can be used as usual without needing 3D meshes of the objects to recognize. The input is represented as pcd files (segmented views of the object) together with a transformation matrix that align a view to a common coordinate frame. This allows to easily train objects in our environment (Kinect + calibration pattern) and allow the use of RGB/texture cues (if available in the sensor) that were not available using 3D meshes. The next image shows an example of a fast experiment where four objects where scanned from different viewpoint using a Kinect and placed into a scene with some clutter in order to be recognized.
The red points represent the overlayed model after being recognized using SHOT, geometric correspondence grouping, SVD, ICP and Papazov’s verification. The downside of not having a 3D mesh is that the results do not look so pretty :) Notice that such an input could as well be used to train the global pipeline. Anyway, I will be doing a “massive” commit later next week with all these modifications. GPU optimizations will be postponed for a while but help is welcomed after the commits.
This last week I have continued working on the recognition framework, focusing on the global pipeline. The global pipelines require segmentation to hypothesize about objects in the scene, each object is then encoded using a global feature (right now available in PCL are VFH, CVFH, ESF, ...) and matched against a training set which objects (their partial views) have been encoded using the same feature. The candidates obtained from the matching stage are post-processed with the Camera Roll Histogram (CRH) to obtain a full 6DOF pose. Finally, the pose can be refined and the best candidate selected by means of an hypotheses verification stage. I will also integrate Alex’s work regarding real time segmentation and euclidean clustering to the global pipeline (see http://www.pointclouds.org/news/new-object-segmentation-algorithms.html).
In summary, I committed the following things to PCL:
These are some results using CVFH, CRH, ICP and the greedy hypotheses verification:
I have as well been playing a bit with CVFH to solve some mirror invariances and in general, increase the descriptive power of the descriptor. Main challenge so far has been finding a semi-global unique and repeatable reference frame. I hope to finish at the beginning of next week with this extension and be able to cleanup the global pipeline so I can commit it. Regarding the main topic of the sprint, we will try some fast face detectors based on depth to efficiently retrieve regions of interest with high probability of containing faces. Another interesting approach that we will definetely try can be found here: http://www.vision.ee.ethz.ch/~gfanelli/head_pose/head_forest.html
Hi again, I integrated the generation of training data for faces into the recognition framework and use the standard recognition pipeline based on SHOT features, Geometric Consistency grouping + RANSAC, SVD to estimate the 6DOF pose and the hypotheses verification from Papazov. The results are pretty cool and encouraging...
Thats me in the first image (should go to the hairdresser...) and the next image is Hannes, a colleague from our lab.
The CAD model used in this case was obtained from http://face.turbosquid.com/ that contains some free 3D meshes of faces. Observe that despite of the training geometry being slightly different than those from the recognized subjects, the model is aligned quite good to the actual face. Notice also the amount of noise in the first image.
I am having interesting conversations with Radu and Federico about how to proceed, so I will post a new entry soon with a concrete roadmap.
Last week I have been working on a small framework for 3D object recognition/classification. It can be found on trunk under apps/3d_rec_framework but be aware that it is far from finished.
This is related to 3D face orientation project as face detection and orientation estimation might be approached using a classic object recognition approach: a training set of the objects to be detected (faces in this case) is available and salient features are computed on the training data. During recognition, the same feature can be computed on the input depth image / point cloud and matched against the training features yielding point-to-point correspondences from which a 3D pose can be estimated usually by means of RANSAC-like approaches.
I am a big fan of using 3D meshes or CAD models for object recognition due to many reasons so I decided to do a small experiment regarding this for face detection. I took a random mesh of a face from the Princeton Shape Benchmark and aligned as depicted in the first image. Yaw, Pitch and Roll are usually used to define a coordinate system for faces.
Because we would like to recognize a face from several orientations, the next step consists in simulating how the mesh would like when seen from different viewpoints using a depth sensor. So, basically we can discretize the yaw,pitch,roll space and render the mesh from the same viewpoint after being transformed using yaw,pitch,roll rotations.
The small window with red background is a VTK window used to render the mesh and the point cloud (red points overlapped with the mesh) is obtained by reading VTKs depth-buffer. The partial view is obtained with a yaw of -20° and a pitch of 10°. The next image is a screenshot of a multi-viewport display when varying the yaw in a range of [-45°,45°].
And maybe more interesting, varying pitch from [-45°,45°] with a 10° step. The points are colored according to their z-value.
Basically, this allows to generate training data easily with known pose/orientation information which represents an interesting opportunity to solve our task. The idea would be to have a big training dataset so that variability among faces (scale, traits, ...) is captured. Same would apply for tracking applications were a single person is to be tracked. A mesh of the face could be generated in real-time (using KinFU) and use as only input for the recognizer. This is probably the next thing I am going to try using FPFH or SHOT for the feature matching stage.