Machine Learning, , 1–34 ()
c© Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.
Text Classification from Labeled and Unlabeled
Documents using EM
KAMAL NIGAM
†
knigam@cs.cmu.edu
ANDREW KACHITES MCCALLUM
‡†
mccallum@justresearch.com
SEBASTIAN THRUN
†
thrun@cs.cmu.edu
TOM MITCHELL
†
tom.mitchell@cmu.edu
†School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
‡Just Research, 4616 Henry Street, Pittsburgh, PA 15213
Received March 15, 1998; Revised February 20, 1999
Editor: William W. Cohen
Abstract. This paper shows that the accuracy of learned text classifiers can be improved by
augmenting a small number of labeled training documents with a large pool of unlabeled docu-
ments. This is important because in many text classification problems obtaining training labels
is expensive, while large quantities of unlabeled documents are readily available.
We introduce an algorithm for learning from labeled and unlabeled documents based on the
combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first
trains a classifier using the available labeled documents, and probabilistically labels the unlabeled
documents. It then trains a new classifier using the labels for all the documents, and iterates
to convergence. This basic EM procedure works well when the data conform to the generative
assumptions of the model. However these assumptions are often violated in practice, and poor
performance can result. We present two extensions to the algorithm that improve classification
accuracy under these conditions: (1) a weighting factor to modulate the contribution of the
unlabeled data, and (2) the use of multiple mixture components per class. Experimental results,
obtained using text from three different real-world tasks, show that the use of unlabeled data
reduces classification error by up to 30%.
Keywords: text classification, Expectation-Maximization, integrating supervised and unsuper-
vised learning, combining labeled and unlabeled data, Bayesian learning