Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation

Introduction

We propose a framework for learning a hierarchical shape vocabulary for multi-class object representation. The vocabulary is compositional, where each shape feature in the hierarchy is composed out of simpler ones by means of spatial relations. Learning is statistical and is performed bottom-up. The approach takes simple oriented contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class specific shape compositions, each exerting a high degree of shape variability. In the top-level of the vocabulary, the compositions are sufficiently large and complex to represent the whole shapes of the objects. We learn the vocabulary layer after layer, by gradually increasing the size of the window of analysis and the spatial resolution at which the shape configurations are learned. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. We assume supervision in terms of a positive and a validation set of class images — however, the hierarchical structure of each class is learned in a completely unsupervised way (no labels on object parts and smaller constituents are assumed).

File:lhop-hierarchy.jpg

Learning results

To train the representation for an individual class it takes on average 20 - 25 minutes. When learning multiple classes incrementally, training time for each additional class reduces.

Examples of learned vocabulary shapes (with the exception of a fixed Layer 1) learned on 1500 natural images. Only the mean of the shape models are depicted:

File:lhop-vocabulary_ly1-3.jpg

Examples of shape models at layers 4, 5, and the final object layer learned for 15 classes for object detection:

File:lhop-vocabulary_ly4-6.jpg

Examples of the complete learned whole-object shape models (with also the learned spatial relations shown):

File:lhop-examples-tree.jpg

Object classification

The features learned at layer 2 and 3 can be combined with a linear SVM for object classification. We report the results for the Caltech-101 dataset and compare with other hierarchical approaches:

File:lhop-caltech-rate.jpg

Object class detection - inference times, performance

Matching a vocabulary learned for each individual class takes from 2 - 4 seconds per image, depending on the size of the image (in our experiments the average size is roughly 700×500) and the amount of texture it contains. For the joint vocabulary of 15 object classes it takes only from 16 - 20 second per image.

Performace for object class detection:

File:lhop-object-det-perf.jpg

We report a highly logarithmic scaling behavior for multi-class object detection:

File:lhop-scaling.jpg

Examples of detections of several object classes:

File:lhop-detection-examples.jpg

Publications

Publications for the topic of Learning a Hierarchy of Parts