Cécile Levasseur, Uwe F. Mayer, and Ken Kreutz-Delgado
Abstract: We consider the problem of both supervised and unsupervised classification for multidimensional data that are nongaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data. GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace. Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification. As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented.
Key words: Generalized Linear Statistics (GLS), exponential family distributions, latent variables, dimensionality reduction, text classification, Reuters-21578.
You can download a copy of this paper (about 6 pages).
Back |