Abstract
We extend a recently developed method [1] for
learning the semantics of image databases using text and
pictures. We incorporate statistical natural language
processing in order to deal with free text. We demonstrate
the current system on a difficult dataset, namely 10,000
images of work from the Fine Arts Museum of San
Francisco. The images include line drawings, paintings,
and pictures of sculpture and ceramics. Many of the
images have associated free text whose varies greatly,
from physical description to interpretation and mood.
We use WordNet to provide semantic grouping
information and to help disambiguate word senses, as
well as emphasize the hierarchical nature of semantic
relationships. This allows us to impose a natural structure
on the image collection, that reflects semantics to a
considerable degree. Our method produces a joint
probability distribution for words and picture elements.
We demonstrate that this distribution can be used (a) to
provide illustrations for given captions and (b) to
generate words for images outside the training set.
Results from this annotation process yield a quantitative
study of our method. Finally, our annotation process can
be seen as a form of object recognizer that has been
learned through a partially supervised process.
1. Introduction
It is a remarkable fact that, while text and images are
separately ambiguous, jointly they tend not to be; this is
probably because the writers of text descriptions of
images tend to leave out what is visually obvious (the
colour of flowers, etc.) and to mention properties that are
very difficult to infer using vision (the species of the
flower, say). We exploit this phenomenon, and extend a
method for organizing image databases using both image
features and associated text ([1], using a probabilistic
model due to Hofmann [2]). By integrating the two kinds
of information during model construction, the system
learns links between the image features and semantics
which can be exploited for better browsing (§3.1), better
search (§