|project:||3D Face Orientation|
|mentor:||Federico Tombari and Radu Bogdan Rusu|
I am currently a PhD candidate at the Vision4Robotics group (Technical University of Vienna). My research interests include object recognition/classification, pose estimation/alignment and perception applications for robotics such as grasping.
This week I committed the code to perform face detection in PCL trunk and wrote the final report summarizing the work done as well as how to use the module.
In the last months I have been working on a new meta-global descriptor called OUR-CVFH (http://rd.springer.com/chapter/10.1007/978-3-642-32717-9_12) that as you can imagine is an extension to CVFH which is in its turn an extension to VFH (i am not very original at dubbing things). I have also commited some tools and pipelines into pcl/apps/3d_rec_framework (still unstable and not very well documented).
Tomorrow we are having a TV demo in the lab where we are showing recent work on recognition/classification and grasping of unknown objects. So, as happens usually, I had to finish some things for it and I would like to show how with OUR-CVFH is it possible to do scale invariant recognition and 6DOF pose estimation + scale. The training objects are in this case downloaded from 3d-net.org (unit scale, whatever unit is) and they usually do not fit the test objects accurately.
Apart from this, I have also extended OUR-CVFH to use color information and integrate in the histogram. Basically, the reference frame obtained in OUR-CVFH is used to create color distributions depending on the spatial distribution of the points. To test the extension, I did some evaluations on the Willow Garage ICRA 11 Challenge dataset obtaining excellent results (about 99% precission and recall). The training dataset is composed of 35 objects and the test set with 40 sequences totalling 435 object instances. A 3D recognition pipeline method based on SIFT (keypoints projected to 3D) obtains about 70% in such a dataset (even though the objects present texture most of the time). Combining SIFT with SHOT and merging the hypotheses together, gets about 84% and the most recent paper on this dataset (Tang et al. from ICRA 2012) obtains about 90% recall at 99% precission. If you are not familiar with the dataset, here are some screenshots and the respective overlayed recognition and pose estimation.
The color extension to OUR-CVFH and the hypotheses verification stage are not yet in PCL but I hope to commit them as soon as possible, probably after ICRA deadline and before ECCV. You can find the Willow ICRA challenge test dataset in PCD format at http://svn.pointclouds.org/data/ICRA_willow_challenge.
I am back from “holidays”, conferences, etc. and today I started dealing with some of the concerns I pointed out in the last email, mainly regarding the memory footprint required to train. The easiest way to deal with that is to do bagging on each tree so that the training samples used at each tree are loaded before start and dismissed after training a specific tree. I implemented that by adding an abstract DataProvider class to the random forest implementation which is specialized depending on the problem. Then, when a tree is trained and a data provider is available, the tree requests training data to the provider, trains and discards the samples.
I also realized that most of the training data I have for faces, contains a lot of NaNs except of the parts containing the face itself and other parts of the body (which are usually localized in the center of the image). So, to reduce further the data in memory, the specialization of the data provider crops the kinect frames discarding regions with only NaN values.
With this two simple tricks, I am able to train each tree in the forest with 2000 random training samples (from each sample 10 positive and 10 negative patches are extracted) requiring only 3GB of RAM. In case that more training data is needed or the training samples become bigger, one might use a similar trick to design an out-of-core implementation where the data is not requested at tree level but at node level and only indices are kept into memory.
I also found some silly bugs and now I am retraining... let’s see what comes out.
I have continued working on the face detection method and added the pose estimation part, including the clustering step mentioned on my last post. See the video for some results from our implementation (at the beginning is a bit slow due to the video recording software, then it gets better).
I fixed several bugs lately and even though the results start looking pretty good I am not yet completely satisfied. First I was facing some problems during training regarding what to do with patches where the features are invalid (division by zero), I ended up using a tree with three branches and that worked better although I am not yet sure which classification measure should be used then (working on that). The other things are: use of normal features which can be computed very fast with newer PCL versions on organized data and a modification on the way the random forest is trained. Right now it requires all training data to be available in memory (one integral image for each training frame or even four of them if normals are used). This ends up taking a lot of RAM and restricts the amount of training data that can be used to train the forest.
Has been some time since my last post. Was on vacation for some days, then sick and afterwards getting all stuff done after the inactivity. Anyway, I have resumed work on head detection + pose estimation reimplementing the approach from Fanelli at ETH. I implemented the regression part of the approach so that the trees provide information about the head location and orientation and did some improvements on the previous code. I used the purity criteria in order to activate regression which seemed the most straightforward.
The red spheres show the predicted head location after filtering sliding windows that reach leaves with high variance and therefore, are not accurate. As you can see there are several red spheres at non head locations. Nevertheless, the approach relies on a final bottom-up clustering to isolate the different heads in the image. The size of the clusters allows to threshold head detections and eventually, remove outliers.
I hope to commit a working (and complete) version quite soon together with lots of other stuff regarding object recognition.