Bootstrapped Language Identification For Multi-Site Internet Domains

Uwe F. Mayer

Abstract: We present an algorithm for language identification, in particular of short documents, for the case of an Internet domain with sites in multiple countries with differing languages. The algorithm is significantly faster than standard language identification methods, while providing state-of-the-art identification. We bootstrap the algorithm based on the language identification based on the site alone, a methodology suitable for any supervised language identification algorithm. We demonstrate the bootstrapping and algorithm on eBay email data and on Twitter status updates data. The algorithm is deployed at eBay as part of the back-office development data repository.

Key words: Language identification, large data, statistical model, boosting.


You can download a copy of this article (about 7 pages including references).

Mayer30.pdf This file is in Portable Document Format. (112 Kbytes)


[leftarrow]Back

mayer@math.utah.edu
First posted: Sat May 19 11:43:22 PDT 2012
Last updated: Fri Jun 1 17:16:07 PDT 2012