Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation
We propose a framework for learning a hierarchical shape vocabulary for multi-class object representation. The vocabulary is compositional, where each shape feature in the hierarchy is composed out of simpler ones by means of spatial relations. Learning is statistical and is performed bottom-up. The approach takes simple oriented contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class specific shape compositions, each exerting a high degree of shape variability. In the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. We assume supervision in terms of a positive and a validation set of class images — however, the hierarchical structure of each class is learned in a completely unsupervised way (no labels on object parts and smaller constituents are assumed).
To train the representation for an individual class it takes on average 20 - 25 minutes. When learning multiple classes incrementally, training time for each additional class reduces.
Examples of learned vocabulary shapes (with the exception of a fixed Layer 1) learned on 1500 natural images. Only the mean of the shape models are depicted:
Examples of shape models at layers 4, 5, and the final object layer learned for 15 classes for object detection:
Examples of the complete learned whole-object shape models (with also the learned spatial relations shown):
The features learned at layer 2 and 3 can be combined with a linear SVM for object classification. We report the results for the Caltech-101 dataset and compare with other hierarchical approaches:
Object class detection - inference times, performance
Matching a vocabulary learned for each individual class takes from 2 - 4 seconds per image, depending on the size of the image (in our experiments the average size is roughly 700×500) and the amount of texture it contains. For the joint vocabulary of 15 object classes it takes only from 16 - 20 second per image.
Performace for object class detection:
We report a highly logarithmic scaling behavior for multi-class object detection:
Examples of detections of several object classes: