Last updates: Wed May 19 09:15:28 2004 Sat Nov 27 11:05:38 2004 Thu Dec 2 15:20:20 2004 Fri Dec 3 13:44:15 2004 Fri Feb 3 16:38:46 2006 Fri Apr 14 07:46:27 2006 Thu Mar 23 14:23:46 2017
The Unicode character set is a character set intended to represent the writing schemes of all of the world's major languages. Although early versions could be represented with 16 bits (65,536 characters), by 1996 at version 2.0, that proved insufficient, and it is now believed that at least 21, and possibly 22, bits will ultimately be required, supporting a few million characters.
At Unicode version 2.0, there were 38,885 assigned characters. At version 3.0, there were 49,194 assigned characters. At version 3.2, there were 95,156 assigned characters. At version 4.0, there are 96,382 assigned characters.
Variable-width encoding schemes have been developed to minimize the number of bytes required to store Unicode characters. Files containing only 7-bit ASCII characters are unchanged when viewed in the Unicode UTF-8 encoding, so plain ASCII files are already valid Unicode files. With UTF-8, up to four 8-bit bytes may be required to access all defined Unicode characters.
Some early Unicode implementors of programming language compilers, and the designers of the Java programming language, chose 16-bit representations: with the Unicode UTF-16 encoding, the first 63,486 characters are represented in 16 bits, while the remaining 2,048 combine with a following 16-bit value to represent another 1,048,544 characters as a pair of 16-bit values. Since 2048 + 63486 = 65534, which is two less than the 65,536 values representable in 16 bits, there are two remaining 16-bit values: U+FFFE and U+FFFF. They are not used to encode characters, but instead are reserved for internal use (U+FFFF as a sentinel, and U+FFFE as a byte-order indicator). Other compiler implementors store Unicode characters in 32-bit integers (the UTF-32 encoding), allowing a simple correspondence of one Unicode glyph to one integer.
The large number of characters in this set naturally poses a severe problem for a font vendor, and also for storage resources on systems that use Unicode. Thus, although the Unicode work has been underway since 1990, font support for Unicode has taken, and will continue to take, years of work, and the available font repertoire is still rather limited. By comparison, tens of thousands of fonts are available for 8-bit character sets: for a sampling, visit this list of font names by vendor. More information about Unicode fonts is given below.
The Unicode Standard is defined in this printed book:
@String{pub-AW = "Ad{\-d}i{\-s}on-Wes{\-l}ey"} @String{pub-AW:adr = "Reading, MA, USA"} @Book{Unicode:2003:USV, author = "{The Unicode Consortium}", title = "The Unicode Standard, Version 4.0", publisher = pub-AW, address = pub-AW:adr, pages = "xxxviii + 1462", year = "2003", ISBN = "0-321-18578-1", LCCN = "QA268 .U545 2004", bibdate = "Tue Oct 21 17:47:30 2003", note = "Includes CD-ROM.", URL = "http://www.unicode.org/versions/Unicode4.0.0/", acknowledgement = ack-nhfb, }
Earlier editions of the Unicode Standard were 1.0 (1991/1992), 1.1 (1993/1995), 2.0 (1996), 2.1 (1998), and 3.0 (2000) [consistent with ISO/IEC 10646-1:2000]. The current version is Unicode version 4.0.
The relation between the Unicode and ISO/IEC 10646 Standards is discussed in Unicode and ISO 10646: although the character codes are synchronized, there are still important differences.
The Unicode Consortium maintains a World-Wide Web site at http://www.unicode.org/
An extensive bibliography of publications about Unicode is available at http://www.math.utah.edu/pub/tex/bib/index-table-u.html#unicode
Unicode is used in at least these operating systems:
The 1989 ANSI/ISO Standard C multibyte and wide character data types can also offer limited support for Unicode. However, most conventional programming languages are not equipped to deal with Unicode characters because they have deeply ingrained assumptions about the storage size of characters.
The Omega typesetting system developed by Yannis Haralambous and John Plaice is an extension of the widely used TeX typesetting system to support and use Unicode.
Most Unix operating system vendors have begun development work to support Unicode in future releases.
Limited Unicode font support is available from:
More information on Unicode fonts can be found at http://www.truetype.demon.co.uk/unicode.htm
The new OpenType font specification, jointly developed by Adobe and Microsoft, provides for future support of Unicode. OpenType is based on a merger of Adobe Type 1 and Apple/Microsoft TrueType font formats, and OpenType systems will support those older fonts as well. Plans for OpenType support have been announced by several major font and operating system vendors.
Roman Czyborra has developed prototype Unicode fonts for the X Window System: see http://czyborra.com/unifont/ for details.
Frank da Cruz maintains useful character set tables in the Kermit Project for verifying correct display of fonts: http://www.columbia.edu/kermit/csettables.html
Markus Kuhn has developed prototype ISO10646-1 (Unicode) fonts for the X Window System; see http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html for details. He has also prepared a comprehensive tutorial on UTF-8 and Unicode: see http://www.cl.cam.ac.uk/~mgk25/unicode.html.
Microsoft maintains a comprehensive Web site on Unicode-related issues at http://www.microsoft.com/globaldev/
There is an open source initiative to develop C/C++ software for Unicode support: International Components for Unicode (ICU): http://oss.software.ibm.com/icu/
Interview with font designer Victor Gaultney on the design of the Gentium font for Unicode.
James Kass maintains a Web site with pointers to Unicode tables and other resources at http://home.att.net/~jameskass/
OpenI18N WG of the Free Standards Group Common Locale Data Repository V1.0
The Script Encoding Initiative at the Department of Linguistics, University of California, Berkeley http://www.linguistics.berkeley.edu/~dwanders/ works on encoding of minority scripts for eventual inclusion in Unicode.
A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX: http://eyegene.ophthy.med.umich.edu/unicode/
Finally, I maintain an extensive, and frequently updated, bibliography of publications about Unicode.