Basics of String Encoding

Written By : Girish Ramakrishnan, ForwardBias Technologies

Encoding standards specify how strings are represented in memory. For example, the ASCII encoding standard specifies the encoding for all characters of the English language (Ascii Table has a nice graph). It specifies that a 'Q' is represented as 0x51, 't' as 0x74 and so on. The ASCII only defines 127 characters and since one needs only 7-bits to encode 0-127, it is referred as a 7-bit encoding.

The ISO-8859-1 or the latin-1 encoding is a 8-bit encoding that is ASCII-compatible because it defines the exact same characters as ASCII for 0-127. The range 128-255 consists mostly of characters of Western European languages (Dutch, Spanish etc).

Encodings like ASCII and latin-1 are inadequate to represent characters of other languages, say, tamil. So, in the absence of standards, organizations have made their own standards and specified their own characters for the range 128-255. For example, there is a TSCII for tamil that defines tamil characters in the 128-255 range.

Code point, Code page, Character set, Encoding

The numbers defined by the encoding tables are called codepoints. The tables themselves are called code pages. To be technically correct, one must say codepoint 0x51 represents the character 'Q' and codepoint 0x74 represents the character 't' in the ASCII code page and so on.

Note that the code point is just a number and it doesn't specify how the number is stored in memory. An encoding scheme defines how code points are represented in memory. For example, an encoding scheme could represent the code points as a 32-bit integer with it's high byte (MSB) first. Another encoding scheme could encode code points using a variable length byte encoding scheme - frequently used code points are encoded as 2 bytes to conserve space and other code points as 3 bytes.

The characters in a code page is called the character set.

Note that the terms defined above were not very relevant for standards like ASCII since they define the character set, the encoding, code page all in one standard. These terms, however, become important when discussing Unicode.

Unicode

TSCII, ASCII are meant to represent the characters of a single language and thus cannot be used to represent text that is composed in multiple languages. The Unicode standard attempts to define code points for characters in every language known to man. It even has code points for fictional languages like klingon and symbols that made sense to ancient civilizations like Phaistos Disc. It provides a massive code page with over 70000 defined characters. The first 256 code points are the exact same as Latin-1. The U+<xxxx> notation is used to refer to code points in the unicode table. For example, codepoint U+0048 refers to 'H'.

One way of categorizing the code points is by dividing them into planes of 2^16 code points each. The first plane from 0 to 0xFFFF is called Basic Multilingual Plane (BMP) and covers all human languages.

The encoding of Unicode code points is defined by various encoding standards that are part of Unicode:

UTF-8 encoding: UTF-8 uses a variable length encoding scheme. All characters from U+0000 to U+007F are represented in a single byte using 7-bits making it ASCII-compatible. For other code points, a special escape character makes the decoder interpret the bytes that follow to form a code point larger than 8-bits. Note that the UTF-8 is not compatible with Latin-1.
UTF-16 encoding: UTF-16 uses 16-bits for each code point. Since Unicode defines code points > 65535, all characters cannot be accommodated in 16-bits. So, it uses a variable bit encoding scheme to map higher code points into 4 bytes (it's either 2 bytes or 4-bytes). To encode higher code points, the Unicode standard reserves some code points in BMP - U+D800 to U+DB7F (termed High Surrogates) and U+DC00 to U+DFFF (termed Low Surrogates). UTF-16 encodes code points > 65535 using one low surrogate and one high surrogate. Since, the surrogates always appear in pairs they are termed surrogate pairs.
UTF-32 : Uses 32-bits for each code point.

Byte order mark (BOM)

For multi-byte encodings (i.e UTF-16 and UTF-32), one needs to also define the endian-ness. The standard defines a sequence U+FEFF that can be injected at the beginning of text. The character itself means 'character of 0-width space', so it has no effect on text layout. The standard also states that U+FFFE (juxtaposition of FEFF) is an invalid character. This means that when a sequence U+FEFF is encountered, one can auto-detect the endianness. The code point U+FEFF is called the Byte Order Mark or BOM. In the absence of the BOM, Unicode assumes that the string is big-endian.

In the absence of a BOM, the byte order has to be specified explicitly for decoding. Explicit byte ordering can be stated by using the UTF-16LE (little endian), UTF-16BE (big endian) decoders.

Composition and Normalization

Many languages allow characters that are a modification of the base form. For example, in hindi, ka, ki, ku, ke are variants of the base form 'k'. European languages allow the accent to be applied to the base characters A, E. Unicode allows one to specify the "base symbol" followed by the "combining symbol" to form the final character. For example, A U+0041 followed by Combining Grave U+300 gives you A WITH GRAVE. This combining mechanism means that one can form characters either by combining or use the already defined characters. Both forms exist in Unicode primarily for compatibility with other standards. For example, it would have been easier to map the code point block U+x-U+y into some other existing pre-unicode standard.

Characters like U+300 are called precomposed or composite or decomposable characters. These characters are well defined by the language (as opposed to meaningless symbols that can be formed using arbitrary composition). Unicode provides a way to decompose the composed characters i.e from À to A and the grave mark. This is called canonical decomposition. This is a reversible process.

There also exists multiple code points that look the same. For example, Mathematical A U+1D400 looks like the English character 'A'. Unicode provides a way to convert (decompose) the mathematical A to Latin A. Such conversions and canonical decompositions are together called compatibility decompositions. Unlike canonical decomposition, compatibility decomposition is not necessarily reversible (i.e latin A cannot be reversed back to mathematical A).

Unicode composition and decomposition matters because strings are compared byte by byte for equality and search operations. Given that there can be two representations of the same character, byte by byte comparisons might not provide expected results. If the user searches for the English alphabet 'A' in a GUI, the mathematical A will not be found (though from a user's point of view they look similar).

The process of converting from one form to another is called normalization. There are 4 conversions possible

NFD - Canonical form decomposition. À will become A and grave symbol. Mathematical A won't become Latin A.
NFKD - Compatibility form decomposition. À will become A and grave symbol. Mathematical A will become Latin A.
NFC - Canonical form composition. First decompose using NFD. Then compose.
NFKC - Canonical form compatibility composition. First decompose using NFKD. Then compose.

Applications decide how to normalize strings based on their use case.

Detecting encoding

It is important to understand that interpreting a bunch of bytes in memory as a string requires the knowledge of it's encoding scheme. It is not possible to accurately determine the encoding for an array of bytes. The encoding of bytes needs to be known from some external source to correctly interpret them.

To reiterate, C-style strings (char *) have no encoding information associated with them. They are simply pointers to bytes whose interpretation depends on the encoding.

Code point, Code page, Character set, Encoding

Unicode

Byte order mark (BOM)

Composition and Normalization

Detecting encoding

More reading