Perspectives in Statistical Language Modeling

Sanjeev Khudanpur, Johns Hopkins University, USA

Statistical language models play an indispensable role in the automatic processing of speech and text, particularly in applications such as speech recognition, machine translation and, more recently, in information retrieval as well. We will first view language model estimation as a problem of discrete probability estimation from small samples, and present a solution in which popular practices such as n-gram discounting and back-off appear as the natural course of action instead of arbitrary choices. We will then discuss limitations of n-gram models, and provide an overview of some recent advances in language modeling, including methods to (i) incorporate syntactic and semantic dependencies, (ii) dynamically adapt language models to new domains and tasks, (iii) exploit very large text repositories such as the worldwide web to obtain estimates, (iv) represent words as continuous-valued variables for more effective modeling, and (v) estimate language models via discriminative criteria to suit the end-application, e.g. acoustically sensitive language models for speech recognition and word-order sensitive models for machine translation.