Exploiting Structural Information for Text
Classication on the WWW
Johannes Furnkranz
Austrian Research Institute for Articial Intelligence
Schottengasse 3, A-1010 Wien, Austria
juffi@ai.univie.ac.at
Abstract. In this paper, we report on a set of experiments that ex-
plore the utility of making use of the structural information of WWW
documents. Our working hypothesis is that it is often easier to clas-
sify a hypertext page using information provided on pages that point to
it instead of using information that is provided on the page itself. We
present experimental evidence that conrms this hypothesis on a set of
Web-pages that relate to Computer Science Departments.
1 Introduction
The advent of the World-Wide Web has rejuvinated the interest in text catego-
rization problems. Vast amounts of documents are available on-line, and catego-
rizing them into meaningful semantic categories is a rewarding and challenging
research problem.
However, current approaches to text categorization on the Web mostly con-
centrate on simple representation schemes that are based on word occurrence
and word frequency. The structural information that is inherent to documents
on the Web is often neglected. There are at least two dierent kinds of struc-
tural information on the Web that could be used to enhance the performance of
current text classication algorithms:
{ the structure of an HTML representation which allows to easily identify
important parts of a document, such as its headings and its title, and
{ the structure of the Web itself, where pages are linked to each other in various
ways.
In this paper, we report on a set of experiments that explores the utility of
such structural information. Our working hypothesis is that (at least in some
domains) it is easier to classify hypertext pages using information provided on
pages that point to a page instead of using information that is provided on the
page itself. There are several reasons for this:
Redundancy: Quite often there is more than one page pointing to a single
pag