Automatic Text Categorization using the Importance of Sentences
Youngjoong Ko, Jinwoo Park, and Jungyun Seo
Department of Computer Science,
Sogang University
1 Sinsu-dong, Mapo-gu
Seoul, 121-742, Korea
{kyj,jwpark}@nlpzodiac.sogang.ac.kr, seojy@ccs.sogang.ac.kr
Abstract
Automatic text categorization is a problem
of automatically assigning text documents to
predefined categories. In order to classify
text documents, we must extract good
features from them. In previous research, a
text document is commonly represented by
the
term frequency and
the
inverted
document frequency of each feature. Since
there is a difference between important
sentences and unimportant sentences in a
document, the features from more important
sentences should be considered more than
other features. In this paper, we measure the
importance
of
sentences
using
text
summarization techniques. Then a document
is represented as a vector of features with
different
weights
according
to
the
importance of each sentence. To verify our
new method, we conducted experiments on
two language newsgroup data sets: one
written by English and the other written by
Korean. Four kinds of classifiers were used
in our experiments: Naïve Bayes, Rocchio,
k-NN, and SVM. We observed that our new
method made a significant improvement in
all classifiers and both data sets.
Introduction
The goal of text categorization is to classify
documents into a certain number of pre-defined
categories. Text categorization is an active
research area
in
information retrieval and
machine learning. A wide range of supervised
learning algorithms has been applied to this
problem using a training data set of categorized
documents. For examples, there are the Naïve
Bayes (McCallum et al., 1998; Ko et al., 2000),
Rocchio (Lewis et al., 1996), Nearest Neighbor
(Yang et al.,
2002),
and Support Vector
Machines (Joachims, 1998).
A text categorization task consists of a
training phase and a text classification phase.
The former includes
the feature extraction
process and the indexing process. The