Basics of String Encoding: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
(Remove non-functioning "toc" command)
(Cleanup)
Line 1: Line 1:
{{Cleanup | reason=Auto-imported from ExpressionEngine.}}
{{LangSwitch}}
 
[[Category:QtInternals]]
[[Category:QtInternals]]
 
''Written By : Girish Ramakrishnan, ForwardBias Technologies''
 
Written By : Girish Ramakrishnan, ForwardBias Technologies
 
= Basics of String Encoding =


Encoding standards specify how strings are represented in memory. For example, the ASCII encoding standard specifies the encoding for all characters of the English language ([http://www.asciitable.com/ Ascii Table] has a nice graph). It specifies that a 'Q' is represented as 0x51, 't' as 0x74 and so on. The ASCII only defines 127 characters and since one needs only 7-bits to encode 0-127, it is referred as a 7-bit encoding.
Encoding standards specify how strings are represented in memory. For example, the ASCII encoding standard specifies the encoding for all characters of the English language ([http://www.asciitable.com/ Ascii Table] has a nice graph). It specifies that a 'Q' is represented as 0x51, 't' as 0x74 and so on. The ASCII only defines 127 characters and since one needs only 7-bits to encode 0-127, it is referred as a 7-bit encoding.
Line 14: Line 9:
Encodings like ASCII and latin-1 are inadequate to represent characters of other languages, say, tamil. So, in the absence of standards, organizations have made their own standards and specified their own characters for the range 128-255. For example, there is a [http://en.wikipedia.org/wiki/Tamil_Script_Code_for_Information_Interchange TSCII] for tamil that defines tamil characters in the 128-255 range.
Encodings like ASCII and latin-1 are inadequate to represent characters of other languages, say, tamil. So, in the absence of standards, organizations have made their own standards and specified their own characters for the range 128-255. For example, there is a [http://en.wikipedia.org/wiki/Tamil_Script_Code_for_Information_Interchange TSCII] for tamil that defines tamil characters in the 128-255 range.


= Code point, Code page, Character set, Encoding =
== Code point, Code page, Character set, Encoding ==


The numbers defined by the encoding tables are called ''codepoints''. The tables themselves are called ''code pages''. To be technically correct, one must say codepoint 0x51 represents the character 'Q' and codepoint 0x74 represents the character 't' in the ASCII code page and so on.
The numbers defined by the encoding tables are called ''codepoints''. The tables themselves are called ''code pages''. To be technically correct, one must say codepoint 0x51 represents the character 'Q' and codepoint 0x74 represents the character 't' in the ASCII code page and so on.
Line 24: Line 19:
Note that the terms defined above were not very relevant for standards like ASCII since they define the character set, the encoding, code page all in one standard. These terms, however, become important when discussing Unicode.
Note that the terms defined above were not very relevant for standards like ASCII since they define the character set, the encoding, code page all in one standard. These terms, however, become important when discussing Unicode.


= Unicode =
== Unicode ==


TSCII, ASCII are meant to represent the characters of a single language and thus cannot be used to represent text that is composed in multiple languages. The Unicode standard attempts to define code points for characters in every language known to man. It even has code points for fictional languages like klingon and symbols that made sense to ancient civilizations like [http://en.wikipedia.org/wiki/Phaistos_Disc Phaistos Disc]. It provides a massive code page with over 70000 defined characters. The first 256 code points are the exact same as Latin-1. The U+<xxxx> notation is used to refer to code points in the unicode table. For example, codepoint [http://www.fileformat.info/info/unicode/char/48/index.htm U+0048] refers to 'H'.
TSCII, ASCII are meant to represent the characters of a single language and thus cannot be used to represent text that is composed in multiple languages. The Unicode standard attempts to define code points for characters in every language known to man. It even has code points for fictional languages like klingon and symbols that made sense to ancient civilizations like [http://en.wikipedia.org/wiki/Phaistos_Disc Phaistos Disc]. It provides a massive code page with over 70000 defined characters. The first 256 code points are the exact same as Latin-1. The U+<xxxx> notation is used to refer to code points in the unicode table. For example, codepoint [http://www.fileformat.info/info/unicode/char/48/index.htm U+0048] refers to 'H'.
Line 35: Line 30:
* UTF-32 : Uses 32-bits for each code point.
* UTF-32 : Uses 32-bits for each code point.


= Byte order mark (BOM) =
== Byte order mark (BOM) ==


For multi-byte encodings (i.e UTF-16 and UTF-32), one needs to also define the endian-ness. The standard defines a sequence U+FEFF that can be injected at the beginning of text. The character itself means 'character of 0-width space', so it has no effect on text layout. The standard also states that U+FFFE (juxtaposition of FEFF) is an invalid character. This means that when a sequence U+FEFF is encountered, one can auto-detect the endianness. The code point U+FEFF is called the Byte Order Mark or BOM. In the absence of the BOM, Unicode assumes that the string is big-endian.
For multi-byte encodings (i.e UTF-16 and UTF-32), one needs to also define the endian-ness. The standard defines a sequence U+FEFF that can be injected at the beginning of text. The character itself means 'character of 0-width space', so it has no effect on text layout. The standard also states that U+FFFE (juxtaposition of FEFF) is an invalid character. This means that when a sequence U+FEFF is encountered, one can auto-detect the endianness. The code point U+FEFF is called the Byte Order Mark or BOM. In the absence of the BOM, Unicode assumes that the string is big-endian.
Line 41: Line 36:
In the absence of a BOM, the byte order has to be specified explicitly for decoding. Explicit byte ordering can be stated by using the UTF-16LE (little endian), UTF-16BE (big endian) decoders.
In the absence of a BOM, the byte order has to be specified explicitly for decoding. Explicit byte ordering can be stated by using the UTF-16LE (little endian), UTF-16BE (big endian) decoders.


= Composition and Normalization =
== Composition and Normalization ==


Many languages allow characters that are a modification of the base form. For example, in hindi, ka, ki, ku, ke are variants of the base form 'k'. European languages allow the accent to be applied to the base characters A, E. Unicode allows one to specify the "base symbol" followed by the "combining symbol" to form the final character. For example, A [http://www.fileformat.info/info/unicode/char/41/index.htm U+0041] followed by Combining Grave [http://www.fileformat.info/info/unicode/char/300/index.htm U+300] gives you [http://www.fileformat.info/info/unicode/char/C0/index.htm A WITH GRAVE]. This combining mechanism means that one can form characters either by combining or use the already defined characters. Both forms exist in Unicode primarily for compatibility with other standards. For example, it would have been easier to map the code point block U+x-U+y into some other existing pre-unicode standard.
Many languages allow characters that are a modification of the base form. For example, in hindi, ka, ki, ku, ke are variants of the base form 'k'. European languages allow the accent to be applied to the base characters A, E. Unicode allows one to specify the "base symbol" followed by the "combining symbol" to form the final character. For example, A [http://www.fileformat.info/info/unicode/char/41/index.htm U+0041] followed by Combining Grave [http://www.fileformat.info/info/unicode/char/300/index.htm U+300] gives you [http://www.fileformat.info/info/unicode/char/C0/index.htm A WITH GRAVE]. This combining mechanism means that one can form characters either by combining or use the already defined characters. Both forms exist in Unicode primarily for compatibility with other standards. For example, it would have been easier to map the code point block U+x-U+y into some other existing pre-unicode standard.
Line 59: Line 54:
Applications decide how to normalize strings based on their use case.
Applications decide how to normalize strings based on their use case.


= Detecting encoding =
== Detecting encoding ==


It is important to understand that interpreting a bunch of bytes in memory as a string requires the knowledge of it's encoding scheme. It is '''not possible''' to accurately determine the encoding for an array of bytes. The encoding of bytes needs to be known from some external source to correctly interpret them.
It is important to understand that interpreting a bunch of bytes in memory as a string requires the knowledge of it's encoding scheme. It is '''not possible''' to accurately determine the encoding for an array of bytes. The encoding of bytes needs to be known from some external source to correctly interpret them.
Line 65: Line 60:
To reiterate, C-style strings (char *) have no encoding information associated with them. They are simply pointers to bytes whose interpretation depends on the encoding.
To reiterate, C-style strings (char *) have no encoding information associated with them. They are simply pointers to bytes whose interpretation depends on the encoding.


= More reading =
== More reading ==


[http://developer.qt.nokia.com/wiki/QtStrings QtStrings] - How Qt implements Unicode
[[QString]] - How Qt implements Unicode
[http://www.cs.tut.fi/~jkorpela/chars.html Tutorial on character code] - A excellent and thorough tutorial on character codes
[http://www.cs.tut.fi/~jkorpela/chars.html Tutorial on character code] - A excellent and thorough tutorial on character codes
[http://www.joelonsoftware.com/articles/Unicode.html Unicode tutorial] - A light weight tutorial of Unicode
[http://www.joelonsoftware.com/articles/Unicode.html Unicode tutorial] - A light weight tutorial of Unicode
"UTF-8 and Latin-1[http://techessence.info/node/60 : The Leaning Tower of Babel] - On the incompatibility of UTF-8 and Latin-1
"UTF-8 and Latin-1[http://techessence.info/node/60 : The Leaning Tower of Babel] - On the incompatibility of UTF-8 and Latin-1
[http://www.gobosoft.com/eiffel/gobo/string/unicode/about.html Unicode Normalization] - Normalization forms in Unicode
[http://www.gobosoft.com/eiffel/gobo/string/unicode/about.html Unicode Normalization] - Normalization forms in Unicode

Revision as of 11:39, 28 June 2015

En Ar Bg De El Es Fa Fi Fr Hi Hu It Ja Kn Ko Ms Nl Pl Pt Ru Sq Th Tr Uk Zh

Written By : Girish Ramakrishnan, ForwardBias Technologies

Encoding standards specify how strings are represented in memory. For example, the ASCII encoding standard specifies the encoding for all characters of the English language (Ascii Table has a nice graph). It specifies that a 'Q' is represented as 0x51, 't' as 0x74 and so on. The ASCII only defines 127 characters and since one needs only 7-bits to encode 0-127, it is referred as a 7-bit encoding.

The ISO-8859-1 or the latin-1 encoding is a 8-bit encoding that is ASCII-compatible because it defines the exact same characters as ASCII for 0-127. The range 128-255 consists mostly of characters of Western European languages (Dutch, Spanish etc).

Encodings like ASCII and latin-1 are inadequate to represent characters of other languages, say, tamil. So, in the absence of standards, organizations have made their own standards and specified their own characters for the range 128-255. For example, there is a TSCII for tamil that defines tamil characters in the 128-255 range.

Code point, Code page, Character set, Encoding

The numbers defined by the encoding tables are called codepoints. The tables themselves are called code pages. To be technically correct, one must say codepoint 0x51 represents the character 'Q' and codepoint 0x74 represents the character 't' in the ASCII code page and so on.

Note that the code point is just a number and it doesn't specify how the number is stored in memory. An encoding scheme defines how code points are represented in memory. For example, an encoding scheme could represent the code points as a 32-bit integer with it's high byte (MSB) first. Another encoding scheme could encode code points using a variable length byte encoding scheme - frequently used code points are encoded as 2 bytes to conserve space and other code points as 3 bytes.

The characters in a code page is called the character set.

Note that the terms defined above were not very relevant for standards like ASCII since they define the character set, the encoding, code page all in one standard. These terms, however, become important when discussing Unicode.

Unicode

TSCII, ASCII are meant to represent the characters of a single language and thus cannot be used to represent text that is composed in multiple languages. The Unicode standard attempts to define code points for characters in every language known to man. It even has code points for fictional languages like klingon and symbols that made sense to ancient civilizations like Phaistos Disc. It provides a massive code page with over 70000 defined characters. The first 256 code points are the exact same as Latin-1. The U+<xxxx> notation is used to refer to code points in the unicode table. For example, codepoint U+0048 refers to 'H'.

One way of categorizing the code points is by dividing them into planes of 2^16 code points each. The first plane from 0 to 0xFFFF is called Basic Multilingual Plane (BMP) and covers all human languages.

The encoding of Unicode code points is defined by various encoding standards that are part of Unicode:

  • UTF-8 encoding: UTF-8 uses a variable length encoding scheme. All characters from U+0000 to U+007F are represented in a single byte using 7-bits making it ASCII-compatible. For other code points, a special escape character makes the decoder interpret the bytes that follow to form a code point larger than 8-bits. Note that the UTF-8 is not compatible with Latin-1.
  • UTF-16 encoding: UTF-16 uses 16-bits for each code point. Since Unicode defines code points > 65535, all characters cannot be accommodated in 16-bits. So, it uses a variable bit encoding scheme to map higher code points into 4 bytes (it's either 2 bytes or 4-bytes). To encode higher code points, the Unicode standard reserves some code points in BMP - U+D800 to U+DB7F (termed High Surrogates) and U+DC00 to U+DFFF (termed Low Surrogates). UTF-16 encodes code points > 65535 using one low surrogate and one high surrogate. Since, the surrogates always appear in pairs they are termed surrogate pairs.
  • UTF-32 : Uses 32-bits for each code point.

Byte order mark (BOM)

For multi-byte encodings (i.e UTF-16 and UTF-32), one needs to also define the endian-ness. The standard defines a sequence U+FEFF that can be injected at the beginning of text. The character itself means 'character of 0-width space', so it has no effect on text layout. The standard also states that U+FFFE (juxtaposition of FEFF) is an invalid character. This means that when a sequence U+FEFF is encountered, one can auto-detect the endianness. The code point U+FEFF is called the Byte Order Mark or BOM. In the absence of the BOM, Unicode assumes that the string is big-endian.

In the absence of a BOM, the byte order has to be specified explicitly for decoding. Explicit byte ordering can be stated by using the UTF-16LE (little endian), UTF-16BE (big endian) decoders.

Composition and Normalization

Many languages allow characters that are a modification of the base form. For example, in hindi, ka, ki, ku, ke are variants of the base form 'k'. European languages allow the accent to be applied to the base characters A, E. Unicode allows one to specify the "base symbol" followed by the "combining symbol" to form the final character. For example, A U+0041 followed by Combining Grave U+300 gives you A WITH GRAVE. This combining mechanism means that one can form characters either by combining or use the already defined characters. Both forms exist in Unicode primarily for compatibility with other standards. For example, it would have been easier to map the code point block U+x-U+y into some other existing pre-unicode standard.

Characters like U+300 are called precomposed or composite or decomposable characters. These characters are well defined by the language (as opposed to meaningless symbols that can be formed using arbitrary composition). Unicode provides a way to decompose the composed characters i.e from À to A and the grave mark. This is called canonical decomposition. This is a reversible process.

There also exists multiple code points that look the same. For example, Mathematical A U+1D400 looks like the English character 'A'. Unicode provides a way to convert (decompose) the mathematical A to Latin A. Such conversions and canonical decompositions are together called compatibility decompositions. Unlike canonical decomposition, compatibility decomposition is not necessarily reversible (i.e latin A cannot be reversed back to mathematical A).

Unicode composition and decomposition matters because strings are compared byte by byte for equality and search operations. Given that there can be two representations of the same character, byte by byte comparisons might not provide expected results. If the user searches for the English alphabet 'A' in a GUI, the mathematical A will not be found (though from a user's point of view they look similar).

The process of converting from one form to another is called normalization. There are 4 conversions possible NFD - Canonical form decomposition. À will become A and grave symbol. Mathematical A won't become Latin A. NFKD - Compatibility form decomposition. À will become A and grave symbol. Mathematical A will become Latin A. NFC - Canonical form composition. First decompose using NFD. Then compose. NFKC - Canonical form compatibility composition. First decompose using NFKD. Then compose.

Applications decide how to normalize strings based on their use case.

Detecting encoding

It is important to understand that interpreting a bunch of bytes in memory as a string requires the knowledge of it's encoding scheme. It is not possible to accurately determine the encoding for an array of bytes. The encoding of bytes needs to be known from some external source to correctly interpret them.

To reiterate, C-style strings (char *) have no encoding information associated with them. They are simply pointers to bytes whose interpretation depends on the encoding.

More reading

QString - How Qt implements Unicode Tutorial on character code - A excellent and thorough tutorial on character codes Unicode tutorial - A light weight tutorial of Unicode "UTF-8 and Latin-1: The Leaning Tower of Babel - On the incompatibility of UTF-8 and Latin-1 Unicode Normalization - Normalization forms in Unicode