JADT 2008 : 9es Journées internationales d’Analyse statistique des Données Textuelles
Arabic statistical language modeling
Karima Meftouh1, Kamel Smaili2, Mohamed-Tayeb Laskri1
1Badji Mokhtar University – Computer Science Department – BP 12 23000 Annaba – Algeria
2INRIA-LORIA – Parole team – BP 101 54602 Villers Les Nancy – France
In this study we propose to investigate statistical language models for Arabic. Several experiments using
different smoothing techniques have been carried out on a small corpus extracted from a daily newspaper. The
sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word
segmentation technique has been employed in order to increase the statistical viability of the corpus. This leads
to a better performance in terms of normalized perplexity.
Keywords: Arabic language, statistical language model, text corpora, perplexity, segmentation.
A statistical language model is used to build up sequence of words, classes or phrases which
are linguistically valid without any use of external knowledge. A list of probabilities is
estimated from a large corpus to indicate the likelihood of linguistic events. An event is any
potential succession of words. This kind of models is useful in a large variety of research
areas (Goodman, 2001): speech recognition, optical character recognition, machine
translation, spelling correction… The common model used in the literature is the well known
n-grams. A word is estimated in accordance to the
previous words. To be efficient
these models need a huge amount of data to train all the needed parameters.
For Arabic, the necessary resources are not as important as what we have for the Indo-
European languages. This is due to the relative recent interest for Arabic applications. In this
paper, we investigate several classical statistical language models in order to study their
pertinence for Arabic language. Sparseness data conducts us to test s