Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text written in all of the world's major . Version 15.1 of the standard defines characters and 161 scripts used in various ordinary, literary, academic, and technical contexts. Many common characters, including numerals, punctuation, and other symbols, are unified within the standard and are not treated as specific to any given writing system. Unicode encodes thousands of emoji, with the continued development thereof conducted by the Consortium as a part of the standard. Moreover, the widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan. Unicode is ultimately capable of encoding more than 1.1 million characters.
Unicode has largely supplanted the previous environment of myriad incompatible character sets, each used within different locales and on different computer architectures. Unicode is used to encode the vast majority of text on the Internet, including most web pages, and relevant Unicode support has become a common consideration in contemporary software development.
The Unicode character repertoire is synchronized with ISO/IEC 10646, each being code-for-code identical with one another. However, The Unicode Standard is more than just a repertoire within which characters are assigned. To aid developers and designers, the standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes include character normalization, character composition and decomposition, collation, and directionality.
Unicode text is processed and stored as binary data using one of several encodings, which define how to translate the standard's abstracted codes for characters into sequences of bytes. The Unicode Standard itself defines three encodings: UTF-8, UTF-16, and UTF-32, though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with ASCII.
The philosophy that underpins Unicode seeks to encode the underlying characters— and grapheme-like units—rather than graphical distinctions considered mere variant thereof, that are instead best handled by the typeface, through the use of markup, or by some other means. In particularly complex cases, such as Han unification, there is considerable disagreement regarding which differences justify their own encodings, and which are only graphical variants of other characters.
At the most abstract level, Unicode assigns a unique number called a to each character. Many issues of visual representation—including size, shape, and style—are intended to be up to the discretion of the software actually rendering the text, such as a web browser or word processor. However, partially with the intent of encouraging rapid adoption, the simplicity of this original model has become somewhat more elaborate over time, and various pragmatic concessions have been made over the course of the standard's development.
The first 256 code points mirror the ISO/IEC 8859-1 standard, with the intent of trivializing the conversion of text already written in Western European scripts. To preserve the distinctions made by different legacy encodings, therefore allowing for conversion between them and Unicode without any loss of information, many characters nearly identical to others, in both appearance and intended function, were given distinct code points. For example, the Halfwidth and Fullwidth Forms block encompasses a full semantic duplicate of the Latin alphabet, because legacy CJK encodings contained both "fullwidth" (matching the width of CJK characters) and "halfwidth" (matching ordinary Latin script) characters.
The Unicode Bulldog Award is given to people deemed to be influential in Unicode's development, with recipients including Tatsuo Kobayashi, Thomas Milo, Roozbeh Pournader, Ken Lunde, and Michael Everson.
In this document, entitled Unicode 88, Becker outlined a scheme using 16-bit characters:
Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body ASCII" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
This design decision was made based on the assumption that only scripts and characters in 'modern' use would require encoding:
Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in the modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 214 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicode.
In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of Research Libraries Group, and Glenn Wright of Sun Microsystems. In 1990, Michel Suignard and Asmus Freytag of Microsoft and NeXT's Rick McGowan had also joined the group. By the end of 1990, most of the work of remapping existing standards had been completed, and a final review draft of Unicode was ready.
The Unicode Consortium was incorporated in California on 3 January 1991, and the first volume of The Unicode Standard was published that October. The second volume, now adding Han ideographs, was published in June 1992.
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts, such as Egyptian hieroglyphs, and thousands of rarely used or obsolete characters that had not been anticipated for inclusion in the standard. Among these characters are various rarely used CJK characters—many mainly being used in proper names, making them far more necessary for a universal encoding than the original Unicode architecture envisioned.
Version 1.0 of Microsoft's TrueType specification, published in 1992, used the name 'Apple Unicode' instead of 'Unicode' for the Platform ID in the naming table.
Over the years several countries or government agencies have been members of the Unicode Consortium. Presently only the Ministry of Endowments and Religious Affairs (Oman) is a full member with voting rights.
The Consortium has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingualism environments.
, a total of 161 scripts are included in the latest version of Unicode (covering , and Syllabary), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and musical notation (in the form of notes and rhythmic symbols), also occur.
The Unicode Roadmap Committee (Michael Everson, Rick McGowan, Ken Whistler, V.S. Umamaheswaran) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap page of the Unicode Consortium website. For some scripts on the Roadmap, such as Jurchen script and Khitan large script, encoding proposals have been made and they are working their way through the approval process. For other scripts, such as Maya script (besides numbers) and Rongorongo, no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
Some modern invented scripts which have not yet been included in Unicode (e.g., Tengwar) or which do not qualify for inclusion in Unicode due to lack of real-world use (e.g., Klingon scripts) are listed in the ConScript Unicode Registry, along with unofficial but widely used Private Use Areas code assignments.
There is also a Medieval Unicode Font Initiative focused on special Latin medieval characters. Part of these proposals has been already included in Unicode.
While the UCS is a simple character map, Unicode specifies the rules, algorithms, and properties necessary to achieve interoperability between different platforms and languages. Thus, The Unicode Standard includes more information, covering in-depth topics such as bitwise encoding, collation, and rendering. It also provides a comprehensive catalog of character properties, including those needed for supporting bidirectional text, as well as visual charts and reference data sets to aid implementers. Previously, The Unicode Standard was sold as a print volume containing the complete core specification, standard annexes, and code charts. However, Unicode 5.0, published in 2006, was the last version printed this way. Starting with version 5.2, only the core specification, published as a print-on-demand paperback, may be purchased. The full text, on the other hand, is published as a free PDF on the Unicode website.
A practical reason for this publication method highlights the second significant difference between the UCS and Unicode—the frequency with which updated versions are released and new characters added. The Unicode Standard has regularly released annual expanded versions, occasionally with more than one version released in a calendar year and with rare cases where the scheduled release had to be postponed. For instance, in April 2020, a month after version 13.0 was published, the Unicode Consortium announced they had changed the intended release date for version 14.0, pushing it back six months from March 2021 to September 2021 due to the COVID-19 pandemic.
The latest version of Unicode, 15.1, was released on 12 September 2023. It is a minor version update to version 15.0 that was released on 13 September 2022. 15.0 added a total of 4,489 new characters, including two new scripts, an extension to the CJK Unified Ideographs block, and multiple additions to existing blocks. 33 new emoji were added, such as the "wireless" (network) symbol and additional colored hearts.
Thus far, the following versions of The Unicode Standard have been published. Update versions, which do not include any changes to character repertoire, are signified by the third number (e.g., "version 4.0.1") and are omitted in the table below.
+ Unicode version history and notable changes to characters and scripts | |||||||
(Vol. 1) | 24 | 7,129 | Initial repertoire covers these scripts: Arabic script, Armenian, Bengali alphabet, Bopomofo, Cyrillic script, Devanagari, Georgian, Greek alphabet, Gujarati script, Gurmukhi script, Hangul, Hebrew alphabet, Hiragana, Kannada script, Katakana, Lao script, Latin script, Malayalam script, Odia script, Tamil script, Telugu script, Thai script, and Tibetan script. | ||||
(Vol. 2) | 25 | 28,327 (21,204 added; 6 removed) | The initial set of 20,902 CJK Unified Ideographs is defined. | ||||
ISO/IEC 10646-1:1993 | 24 | 34,168 (5,963 added; 89 removed; 33 reclassified as control characters) | 4,306 more Hangul syllables, Tibetan script removed | ||||
ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7 | 25 | 38,885 (11,373 added; 6,656 removed) | Original set of Hangul syllables removed, new set of 11,172 Hangul syllables added at new location, Tibetan script added back in a new location and with a different character repertoire, Surrogate character mechanism defined, Plane 15 and Plane 16 Private Use Areas allocated | ||||
ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, as well as two characters from Amendment 18 | 25 | 38,887 (2 added) | Euro sign, Object Replacement Character | ||||
ISO/IEC 10646-1:2000 | 38 | 49,194 (10,307 added) | Cherokee, Ethiopic, Khmer script, Mongolian script, Burmese script, Ogham, Runes, Sinhala script, Syriac alphabet, Thaana, Unified Canadian Aboriginal Syllabics, and Yi script, Braille patterns | ||||
ISO/IEC 10646-1:2000 ISO/IEC 10646-2:2001 | 41 | 94,140 (44,946 added) | Deseret alphabet, Gothic alphabet and Old Italic, sets of symbols for Western music and Byzantine music, 42,711 additional CJK Unified Ideographs. | ||||
ISO/IEC 10646-1:2000 plus Amendment 1 ISO/IEC 10646-2:2001 | 45 | 95,156 (1,016 added) | Philippines scripts (Buhid script, Hanunoo script, Baybayin, and Tagbanwa script) | ||||
ISO/IEC 10646:2003 | 52 | 96,382 (1,226 added) | Cypriot syllabary, Limbu script, Linear B, Osmanya script, Shavian alphabet, Tai Le, and Ugaritic, Hexagram symbols and "ș" and "ț" characters to support Romanian | ||||
ISO/IEC 10646:2003 plus Amendment 1 | 59 | 97,655 (1,273 added) | Lontara script, Glagolitic, Kharosthi, New Tai Lue, Old Persian, Sylheti Nagri, and Tifinagh, Coptic alphabet disunified from Greek alphabet, ancient Greek numbers and musical symbols First named character sequences were introduced. | ||||
ISO/IEC 10646:2003 plus Amendments 1 and 2, as well as four characters from Amendment 3 | 64 | 99,024 (1,369 added) | Balinese script, Cuneiform, N'Ko, ʼPhags-pa, Phoenician | ||||
ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4 | 75 | 100,648 (1,624 added) | Carian alphabets, Cham script, Kayah Li, Lepcha script, Lycian script, Lydian script, Ol Chiki, Rejang alphabet, Saurashtra, Sundanese script, and Vai syllabary, sets of symbols for the Phaistos Disc, Mahjong, Dominoes, additions to Burmese script, Scribal abbreviations, addition of capital ẞ | ||||
ISO/IEC 10646:2003 plus Amendments 1, 2, 3, 4, 5 and 6 | 90 | 107,296 (6,648 added) | Avestan alphabet, Bamum script, Gardiner Set of Egyptian hieroglyphs, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese script, Kaithi, Fraser script, Meitei script, Old South Arabian, Old Turkic, Samaritan script, Tai Tham and Tai Viet, additional CJK Unified Ideographs, Jamo for Hangul, Vedic Sanskrit | ||||
ISO/IEC 10646:2010 plus the Indian rupee sign | 93 | 109,384 (2,088 added) | Batak script, Brahmi script, Mandaic alphabet, playing card symbols, Traffic sign and map symbols, alchemical symbols, emoticons and .additional CJK Unified Ideographs | ||||
ISO/IEC 10646:2012 | 100 | 110,116 (732 added) | Chakma script, Meroitic script, Meroitic script, Pollard script, Sharada script, Sora Sompeng, and Takri script | ||||
ISO/IEC 10646:2012 plus the Turkish lira sign | 100 | 110,117 (1 added) | Turkish lira sign | ||||
ISO/IEC 10646:2012 plus six characters | 100 | 110,122 (5 added) | 5 bidirectional formatting characters | ||||
ISO/IEC 10646:2012 plus Amendments 1 and 2, as well as the Ruble sign | 123 | 112,956 (2,834 added) | Bassa Vah, Caucasian Albanian, Duployan, Elbasan script, Grantha script, Khojki script, Khudabadi script, Linear A, Mahajani, Manichaean, Mende Kikakui, Modi script, Mro script, Nabataean script, Old North Arabian, Old Permic, Pahawh Hmong, Palmyrene, Pau Cin Hau, Psalter Pahlavi, Siddham, Tirhuta script, Warang Citi, and | ||||
ISO/IEC 10646:2014 plus Amendment 1, as well as the Georgian lari, nine CJK unified ideographs, and 41 emoji | 129 | 120,672 (7,716 added) | Ahom script, Anatolian hieroglyphs, Hatran alphabet, Multani script, Old Hungarian, SignWriting, additional CJK Unified Ideographs, lowercase letters for Cherokee, 5 emoji skin tone modifiers. | ||||
ISO/IEC 10646:2014 plus Amendments 1 and 2, as well as Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols | 135 | 128,172 (7,500 added) | Adlam script, Bhaiksuki script, Marchen, Pracalit script, Osage script, Tangut script, 72 emoji. | ||||
ISO/IEC 10646:2017 plus 56 emoji characters, 285 hentaigana characters, and 3 Zanabazar Square characters | 139 | 136,690 (8,518 added) | Zanabazar Square, Soyombo script, Masaram Gondi, Nüshu, hentaigana, 7,494 CJK Unified Ideographs, 56 emoji, bitcoin symbol | ||||
ISO/IEC 10646:2017 plus Amendment 1, as well as 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters. | 146 | 137,374 (684 added) | Dogri script, Georgian Mtavruli capital letters, Gunjala Gondi, Hanifi Rohingya, Indic Siyaq Numbers, Makasar, Medefaidrin, Sogdian alphabet, Maya numerals, 5 CJK Unified Ideographs, symbols for xiangqi and star ratings, 145 emoji | ||||
ISO/IEC 10646:2017 plus Amendments 1 and 2, as well as 62 additional characters. | 150 | 137,928 (554 added) | Elymaic, Nandinagari, Nyiakeng Puachue Hmong, Wancho script, Pollard script, hiragana and katakana small letters, Tamil script historic fractions and symbols, Lao script letters for Pali, Latin letters for Egyptological and Ugaritic transliteration, hieroglyph format controls, 61 emoji | ||||
150 | 137,929 (1 added) | A single character at U+32FF for the square ligature form of the name of the Reiwa era | |||||
ISO/IEC 10646:2020 | 154 | 143,859 (5,930 added) | Chorasmian, Dhives Akuru, Khitan small script, Yezidi, 4,969 CJK ideographs (including 4,939 in Ext. G), Arabic script additions used to write Hausa language, Wolof language, and other African languages, additions used to write Hindko and Punjabi language in Pakistan, Bopomofo additions used for Cantonese, Creative Commons license symbols, graphic characters for compatibility with teletext and home computer systems, 55 emoji | ||||
159 | 144,697 (838 added) | Toto language, Cypro-Minoan, Vithkuqi script, Old Uyghur, Tangsa language, extended IPA, Arabic script additions for use in languages across Africa and in Iran, Pakistan, Malaysia, Indonesia, Java, and Bosnia, additions for honorifics and Quranic use, additions to support languages in North America, the Philippines, India, and Mongolia, addition of the Kyrgyzstani som currency symbol, Znamenny chant musical notation, 37 emoji | |||||
161 | 149,186 (4,489 added) | Kawi script and Mundari Bani, 20 emoji, 4,192 CJK ideographs, control characters for Egyptian hieroglyphs | |||||
161 | 149,813 (627 added) | Additional CJK ideographs |
In this normative notation, the two-character prefix U+ always precedes a written code point, and the code points themselves are written as hexadecimal numbers. At least four hexadecimal digits are always written, with prepended as needed.
For example, the code point is padded with two leading zeros, but () is not padded.
There are a total of (17 × 216) − 211 = valid code points within the codespace. Even though 4 bytes are used when encoding many characters using UTF-8, at first blush potentially allowing for code points up to , due to limitations arising from UTF-16's use of surrogate pairs, the 211 = code points in the range U+D800–U+DFFF, as well as all code points and above are permanently unusable for encoding characters.
Within each plane, characters are allocated within named blocks of related characters. The size of a block is always a multiple of 16, and is often a multiple of 128, but is otherwise arbitrary. Characters required for a given script may be spread out over several different, potentially disjunct blocks within the codespace.
The points in the range – are known as high-surrogate code points, and code points in the range – ( code points) are known as low-surrogate code points. A high-surrogate code point followed by a low-surrogate code point forms a surrogate pair in UTF-16 in order to represent code points greater than . In principle, these code points cannot otherwise be used, though in practice this rule is often ignored, especially when not using UTF-16.
A small set of code points are guaranteed never to be assigned to characters, although third-parties may make independent use of them at their discretion. There are 66 of these noncharacters: – and any code point with a representation ending in FFFE or FFFF (e.g. , , , , ... etc.). The set of noncharacters is stable, and no new noncharacters will ever be defined. Like surrogates, the rule that these cannot be used is often ignored, although the operation of the byte order mark assumes that will never be the first code point in a text. The exclusion of surrogates and noncharacters leaves code points available for use.
Private-use code points are considered to be assigned, but they intentionally have no interpretation specified by The Unicode Standard such that any interchange of such code points requires an independent agreement between the sender and receiver as to their interpretation. There are three private-use areas in the Unicode codespace:
Graphic characters are those defined by The Unicode Standard to have particular semantics, either having a visible glyph shape or representing a visible space. As of Unicode 15.1, there are graphic characters.
Format characters are characters that do not have a visible appearance but may have an effect on the appearance or behavior of neighboring characters. For example, and may be used to change the default shaping behavior of adjacent characters (e.g. to inhibit ligatures or request ligature formation). There are 172 format characters in Unicode 15.1.
65 code points, the ranges – and –, are reserved as control codes, corresponding to the C0 and C1 control codes as defined in ISO/IEC 6429. LINE TABULATION, LINE FEED, and CARRIAGE RETURN are widely used in texts using Unicode. In a phenomenon known as mojibake, the C1 code points are improperly decoded according to the Windows-1252 codepage, previously widely used in Western European contexts.
Together, graphic, format, control code, and private use characters are collectively referred to as assigned characters. Reserved code points are those code points that are valid and available for use, but have not yet been assigned. As of Unicode 15.1, there are reserved code points.
All assigned characters have a unique and immutable name by which they are identified. This immutability has been guaranteed since version 2.0 of The Unicode Standard by its Name Stability policy. In cases where a name is seriously defective and misleading, or has a serious typographical error, a formal alias may be defined that applications are encouraged to use in place of the official character name. For example, has the formal alias , and has the formal alias .
An example of this arises with the Korean alphabet Hangul: Unicode provides a mechanism for composing Hangul syllables from their individual Hangul Jamo subcomponents. However, it also provides combinations of precomposed syllables made from the most common jamo.
CJK characters presently only have codes for uncomposable radicals and precomposed forms. Most Han characters have either been intentionally composed from, or reconstructed as compositions of, simpler orthographic elements called radicals, so in principle Unicode could have enabled their composition as it did with Hangul. While this could have greatly reduced the number of required code points, as well as allowing the algorithmic synthesis of many arbitrary new characters, the complexities of character etymologies and the post-hoc nature of radical systems add immense complexity to the proposal. Indeed, attempts to design CJK encodings on the basis of composing radicals have been met with difficulties resulting from the reality that Chinese characters do not decompose as simply or as regularly as Hangul does.
The CJK Radicals Supplement block is assigned to the range –, and the Kangxi radicals are assigned to –. The Ideographic Description Sequences block covers the range –, but The Unicode Standard warns against using its characters as an alternate representation for characters encoded elsewhere:
Instructions are also embedded in fonts to tell the operating system how to properly output different character sequences. A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally, this approach is only effective in monospaced fonts but may be used as a fallback rendering method when more complex methods fail.
The standard DIN 91379 specifies a subset of Unicode letters, special characters, and sequences of letters and diacritic signs to allow the correct representation of names and to simplify data exchange in Europe. This standard supports all of the official languages of all European Union countries, as well as the German minority languages and the official languages of Iceland, Liechtenstein, Norway, and Switzerland. To allow the transliteration of names in other writing systems to the Latin script according to the relevant ISO standards, all necessary combinations of base letters and diacritic signs are provided.
+ | |
A0–FF | Latin-1 Supplement (80–FF) |
8F, 92, B7, DE-EF, FA–FF | Latin Extended-B (80–FF ...) |
59, 7C, 92 | IPA Extensions (50–AF) |
BB–BD, C6, C7, C9, D6, D8–DB, DC, DD, DF, EE | Spacing Modifier Letters (B0–FF) |
7F, 82 | Superscripts and Subscripts (70–9F) |
A3–A4, A7, AC, AF | Currency Symbols (A0–CF) |
5B–5E | Number Forms (50–8F) |
90–93, 94–95, A8 | Arrows (90–FF) |
80, 84, 88, 8C, 90–93 | Block Elements (80–9F) |
A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6 | Geometric Shapes (A0–FF) |
Rendering software that cannot process a Unicode character appropriately often displays it as an open rectangle, or as to indicate the position of the unrecognized character. Some systems have made attempts to provide more information about such characters. Apple's Last Resort font will display a substitute glyph indicating the Unicode range of the character, and the SIL International's Unicode fallback font will display a box showing the hexadecimal scalar value of the character.
Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Coded Character Set (UCS) encodings. An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code units. All UTF encodings map code points to a unique sequence of bytes. The numbers in the names of the encodings indicate the number of bits per code unit (for UTF encodings) or the number of bytes per code unit (for UCS encodings and UTF-1). UTF-8 and UTF-16 are the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.
UTF encodings include:
UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for the interchange of Unicode text. It is used by FreeBSD and most recent Linux distributions as a direct replacement for legacy encodings in general text handling.
The UCS-2 and UTF-16 encodings specify the Unicode byte order mark (BOM) for use at the beginnings of text files, which may be used for byte-order detection (or endianness detection). The BOM, encoded as , has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; (the result of byte-swapping ) does not equate to a legal character, and in places other than the beginning of text conveys the zero-width non-break space (a character with no appearance and no effect other than preventing the formation of ligatures).
The same character converted to UTF-8 becomes the byte sequence EF BB BF. The Unicode Standard allows the BOM "can serve as a signature for UTF-8 encoded text where the character set is unmarked".
In UTF-32 and UCS-4, one 32-bit code unit serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code unit manifests as a byte sequence). In the other encodings, each code point may be represented by a variable number of code units. UTF-32 is widely used as an internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system that uses the gcc compilers to generate software uses it as the standard "wide character" encoding. Some programming languages, such as Seed7, use UTF-32 as an internal representation for strings and characters. Recent versions of the Python programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for Unicode strings, effectively disseminating such encoding in high-level coded software.
Punycode, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the ASCII-based Domain Name System (DNS). The encoding is used as part of IDNA, which is a system enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. Earlier and now historical proposals include UTF-5 and UTF-6.
GB18030 is another encoding form for Unicode, from the Standardization Administration of China. It is the official character set of the People's Republic of China (PRC). BOCU-1 and SCSU are Unicode compression schemes. The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18.
All internet protocols maintained by IETF, e.g. FTP, have required support for UTF-8 since the publication of in 1998, which specified that all IETF protocols "MUST be able to use the UTF-8 charset".
UTF-8 (originally developed for Plan 9) has become the main storage encoding on most Unix-like operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character sets. UTF-8 is also the most common Unicode encoding used in HTML documents on the World Wide Web.
Multilingual text-rendering engines which use Unicode include Uniscribe and DirectWrite for Microsoft Windows, ATSUI and Core Text for macOS, and Pango for GTK+ and the GNOME desktop.
ISO/IEC 14755, which standardises methods for entering Unicode characters from their code points, specifies several methods. There is the Basic method, where a beginning sequence is followed by the hexadecimal representation of the code point and the ending sequence. There is also a screen-selection entry method specified, where the characters are listed in a table on a screen, such as with a character map program.
Online tools for finding the code point for a known character include Unicode Lookup by Jonathan Hedley and Shapecatcher by Benjamin Milde. In Unicode Lookup, one enters a search key (e.g. "fractions"), and a list of corresponding characters with their code points is returned. In Shapecatcher, based on Shape context, one draws the character in a box and a list of characters approximating the drawing, with their code points, is returned.
The IETF has defined a framework for internationalized email using UTF-8, and has updated several protocols in accordance with that framework.
The adoption of Unicode in email has been very slow. Some East Asian text is still encoded in encodings such as ISO-2022, and some devices, such as mobile phones, still cannot correctly handle Unicode data. Support has been improving, however. Many major free mail providers such as Yahoo! Mail, Gmail, and Outlook.com support it.
Although syntax rules may affect the order in which characters are allowed to appear, XML (including XHTML) documents, by definition, comprise characters from most of the Unicode code points, with the exception of:
HTML characters manifest either directly as according to the document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. For example, the references Δ, Й, ק, م, ๗, あ, 叶, 葉, and 말 (or the same numeric values expressed in hexadecimal, with &#x as the prefix) should display on all browsers as Δ, Й, ק ,م, ๗, あ, 叶, 葉, and 말.
When specifying URIs, for example as in HTTP requests, non-ASCII characters must be percent encoding.
Free and retail based on Unicode are widely available, since TrueType and OpenType support Unicode (and Web Open Font Format (WOFF and WOFF2) is based on those). These font formats map Unicode code points to glyphs, but OpenType and TrueType font files are restricted to 65,535 glyphs. Collection files provide a "gap mode" mechanism for overcoming this limit in a single font file. (Each font within the collection still has the 65,535 limit, however.) A TrueType Collection file would typically have a file extension of ".ttc".
Thousands of fonts exist on the market, but fewer than a dozen fonts—sometimes described as "pan-Unicode" fonts—attempt to support the majority of Unicode's character repertoire. Instead, Unicode-based fonts typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i.e., font substitution. Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of diminishing returns for most typefaces.
In terms of the newline, Unicode introduced and . This was an attempt to provide a Unicode solution to encoding paragraphs and lines semantically, potentially replacing all of the various platform solutions. In doing so, Unicode does provide a way around the historical platform-dependent solutions. Nonetheless, few if any Unicode solutions have adopted these Unicode line and paragraph separators as the sole canonical line ending characters. However, a common approach to solving this issue is through newline normalization. This is achieved with the Cocoa text system in Mac OS X and also with W3C XML and HTML recommendations. In this approach, every possible newline character is converted internally to a common newline (which one does not really matter since it is an internal operation just for rendering). In other words, the text system can correctly treat the character as a newline, regardless of the input's actual encoding.
Unicode has been criticized for failing to separately encode older and alternative forms of kanji which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names. This is often due to the fact that Unicode encodes characters rather than glyphs (the visual representations of the basic character that often vary from one language to another). The unification of glyphs leads to the perception that the languages themselves, not just the basic character representation, are being merged. There have been several attempts to create alternative encodings that preserve the stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. An example of one is TRON (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it).
Although the repertoire of fewer than 21,000 Han characters in the earliest version of Unicode was largely limited to characters in common modern usage, Unicode now includes more than 97,000 Han characters, and work is continuing to add thousands more historic and dialectal characters used in China, Japan, Korea, Taiwan, and Vietnam.
Modern font technology provides a means to address the practical issue of needing to depict a unified Han character in terms of a collection of alternative glyph representations. The 'locl' OpenType table allows a renderer to select a different glyph for a character based on the text locale. The Unicode variation sequences can also provide in-text annotation of desired glyph selection, but no such sequences for Han characters have been standardized.
Injective mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as Shift-JIS or EUC-JP and Unicode led to round-trip format conversion mismatches, particularly the mapping of the character JIS X 0208 '~' (1-33, WAVE DASH), heavily used in legacy database data, to either (in Microsoft Windows) or (other vendors). AFII contribution about WAVE DASH,
Some Japanese computer programmers objected to Unicode because it requires them to separate the use of and , which was mapped to 0x5C in JIS X 0201, and a lot of legacy code exists with this usage. ISO 646-* Problem, Section 4.4.3.5 of Introduction to I18n, Tomohiro KUBOTA, 2001 (This encoding also replaces tilde '~' 0x7E with macron '¯', now 0xAF.) The separation of these characters exists in ISO 8859-1, from long before Unicode.
Thai alphabet support has been criticized for its ordering of Thai characters. The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of phonetic order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the Thai Industrial Standard 620, which worked in the same way, and was the way in which Thai had always been written on keyboards. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation. Even if Unicode had adopted encoding according to spoken order, it would still be problematic to collate words in dictionary order. E.g., the word "perform" starts with a consonant cluster "สด" (with an inherent vowel for the consonant "ส"), the vowel แ-, in spoken order would come after the ด, but in a dictionary, the word is collated as it is written, with the vowel following the ส.
While Unicode defines the script designator (name) to be "", in that script's character names, a hyphen is added: . This, however, is not an anomaly, but the rule: hyphens are replaced by underscores in script designators.
A security advisory was released in 2021 by two researchers, one from the University of Cambridge and the other from the University of Edinburgh, in which they assert that the BiDi marks can be used to make large sections of code do something different from what they appear to do. The problem was named "Trojan Source". In response, code editors started highlighting marks to indicate forced text-direction changes.
Security
See also
Notes
Further reading
<--* The Unicode Standard, Version 3.0, The Unicode Consortium, Addison-Wesley Longman, Inc., April 2000.
External links
|
|