We propose a framework for learning a hierarchical shape vocabulary for multi-class object representation. The vocabulary is compositional, where each shape feature in the hierarchy is composed out of simpler ones by means of spatial relations. Learning is statistical and is performed bottom-up. Inspired by the principles of efficient indexing, robust matching, and ideas of compositionality, our approach learns a hierarchy of spatially flexible compositions, i.e. parts, in an unsupervised, statistics-driven manner. Starting with simple, frequent features, we learn the statistically most significant compositions (parts composed of parts), which consequently define the next layer. Parts are learned sequentially, layer after layer, optimally adjusting to the visual data. Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy. Built in this way, new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher layers. The approach is demonstrated on a large collection of images and a variety of object categories.
Examples of the learned compositions per each layer with the exception of the first layer with a fixed set of parts.
Examples of the detections obtained from the highest layer of the hierarchical compositional model: