Classifying non-Gaussian and Mixed Data Sets
in their Natural Parameter Space

Cécile Levasseur, Uwe F. Mayer, and Ken Kreutz-Delgado

Abstract: We consider the problem of both supervised and unsupervised classification for multidimensional data that are nongaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data. GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace. Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification. As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented.

Key words: Generalized Linear Statistics (GLS), exponential family distributions, latent variables, dimensionality reduction, text classification, Reuters-21578.


You can download a copy of this paper (about 6 pages).

Mayer22.pdf This file is in Portable Document Format. (100 Kbytes)


[leftarrow]Back

mayer@math.utah.edu
Fri Jun 12 14:43:35 MDT 2009
Last updated: Thu Mar 17 20:08:57 PDT 2016