Basics of Locales

Written By : Girish Ramakrishnan, ForwardBias Technologies

Language id

A language id identifies a human language and it's written form. It compromises of a language tag and one or more subtags that narrow the language down to a specific dialect/variation. For example, en-IN refers to the variation of english spoken in India. BCP47 specifies the best current practice for specifying language codes. Per the document, the two or three letter language code is to be picked from ISO639-1 or ISO639-2. The subtag can be a country code picked from ISO3166-1. The language tag and subtag have to separated using a hyphen.

Locale id

A locale code identifies a set of user preferences that can help software represent data like numbers, currency symbols, date and time format, translated text and sort order. As opposed to a language code whose purpose is to specify a language/dialect, the purpose of a locale identifier is to also provide the cultural context.

Representation of locale ids is operating system/library dependent. As an example, following the CLDR specification, the locale id "en-IN_GB" specifies the english language variant spoken in India (en-IN) by a person living in Great Britain. Glibc specifies locale ids using the "languagecode_countrycode.charset@modifier" format.

Locales

A locale id serves to represent the customs and notations of a certain group of people. The term locale is used to designate the customs and notations of a specific user. For example, a user can prefer the application to be translated in swedish but prefer a 24 hour format for time and expect the thousands separator in numbers to be "," as opposed to ".".

Locale ids are thus provided for each "category" - numbers, currency symbols, collation order etc. All these preferences put together form a user's locale.

The "C" locale

Many standard C library functions provide locale support - strtoul, scanf, etc. For example, they need it to determine whether a "." or "," is the decimal separator. The ISO C standard defines a locale called the C locale which provides some neutral default settings for these functions to work. This is the same as english locales.

All programs on startup use the C locale. One needs to figure out the locale (discussed in next section) and explicitly set it up in the application.

Getting the locale information

On Linux, a program gets the locale information by reading various environment variables. How these environment variables are setup are distribution specific. The 'locale' program prints out the locale information in the environment. In brief, the LANG variable provides the locale id for all categories in a single shot. One can, however, specify different locales for specific categories by setting LC_xxx variables.

On Windows, GetUserDefaultLCID and GetLocaleInfo can be used to get the locale information.

On Mac OS X, CFLocaleGetIdentifier will return the locale id.

Local 8-bit encoding

In the pre-unicode world, countries, organizations and OEMs eager to support different languages extended the ASCII encoding to suit their purposes. In such schemes, the characters 0-127 were the exact same as that defined by ASCII. The characters 128-255 were defined to suite the OEMs needs. This meant that the interpretation of byte 250 in a file depends on the character set that was used during the creation time of the file. This character encoding information is referred to as the "local 8-bit encoding".

The 8-bit encoding information is also used by devices like the console (cmd.exe) and terminals (konsole). These programs interpret the data sent to it based on the local 8-bit encoding.

On linux, file names have no concept of character set. They are just a bunch of bytes which then get interpreted as strings based on the local 8-bit encoding (On Windows, the NTFS encodes file names as UTF-16). This is also true for the contents of the files, they are uninterpreted byte streams.

The local 8-bit encoding to be used is sometimes considered to be part of the locale information. For example, locale id "en_IN.utf8" identifies UTF-8 as the local encoding to be used. All file contents, file names, terminal input/output are to be interpreted as UTF-8.

When printing on the console, one should never write UTF-8 (QString::toUtf8), they should instead write local 8-bit data (QString::toLocal8Bit) - though it is very likely that the local 8-bit encoding is UTF-8.