<p>Fast Text Compression with Neural Networks
Matthew V. Mahoney
Florida Institute of Technology
150 W. University Blvd.
Melbourne FL 32901
mmahoney@cs.fit.edu
Abstract
Neural networks have the potential to extend data com-
pression algorithms beyond the character level n-gram
models now in use, but have usually been avoided because
they are too slow to be practical. We introduce a model that
produces better compression than popular Limpel-Ziv
compressors (zip, gzip, compress), and is competitive in
time, space, and compression ratio with PPM and Burrows-
Wheeler algorithms, currently the best known. The com-
pressor, a bit-level predictive arithmetic encoder using a 2
layer, 4 × 106 by 1 network, is fast (about 104 charac-
ters/second) because only 4-5 connections are simultane-
ously active and because it uses a variable learning rate
optimized for one-pass training.
Keywords: Neural networks, Text compression, Data
compression, On-line training/learning, Maximum entropy,
Prediction, Efficiency.
1. Introduction
One of the motivations for using neural networks for data
compression is that they excel in complex pattern recog-
nition. Standard compression algorithms, such as Limpel-
Ziv or PPM (Bell, Witten, and Cleary, 1989) or Burrows-
Wheeler (Burrows and Wheeler, 1994) are based on sim-
ple n-gram models: they exploit the nonuniform distribu-
tion of text sequences found in most data. For example,
the character trigram the is more common than qzv in
English text, so the former would be assigned a shorter
code. However, there are other types of learnable redun-
dancies that cannot be modeled using n-gram frequencies.
For example, Rosenfeld (1996) combined word trigrams
with semantic associations, such as “fire...heat”, where
certain pairs of words are likely to occur near each other
but the intervening text may vary, to achieve an unsur-
passed word perplexity of 68, or about 1.23 bits per char-
acter (bpc)1, on the 38 million word Wall Street Journal
1
R