QString

From Qt Wiki
Revision as of 11:54, 19 April 2019 by Vladeno (talk | contribs) (a small mistake in an octal code)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

En Ar Bg De El Es Fa Fi Fr Hi Hu It Ja Kn Ko Ms Nl Pl Pt Ru Sq Th Tr Uk Zh

Written by Girish Ramakrishnan, ForwardBias Technologies

The fundamentals of encoding are covered in Basics of String Encoding. QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it's contents represent. This is in contrast to a C-style string (char*) that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.

Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. One main reason to use UTF-16 as the internal representation is that it makes it fast to use them with native unicode APIs' on the Mac OS X and Windows (which expect UTF-16).

For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See Using QByteArray for more details.

Using C-style strings with QString

QString string("Qt");

The above code is saved in some file with encoding called the the input charset. The compiler generates code that puts the C-style string "Qt" in memory with possibly some other encoding called the exec charset. At run time, QString gets a pointer to this memory location and needs to interpret and convert the bytes to unicode.

For converting the C-style string to Unicode, QString needs to know the exec charset. By default, Qt assumes that this is ASCII. Internally, this conversion uses the same code path as QString::fromAscii(). QString::fromAscii(), in turn, attempts to decode the characters as Latin-1 (since Ascii and Latin-1 are compatible). It is thus possible to get away with placing Latin-1 characters in C-strings.

QTextCodec::setCodecForCStrings(exec-charset) can be used to change the encoding that Qt uses to decode C-style strings. Calling this function makes QString::fromAscii() decode C-style strings using the new charset (in other words, it doesn't decode ASCII anymore).

The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not ASCII. A common case this occurs is when source files contain non-ASCII characters. Such source files are saved as UTF-8 and the exec charset of the compiler is set to UTF-8. QTextCodec::setCodecForCStrings("UTF-8") can then be used to make Qt interpret all the char* pointers correctly as UTF-8.

Even though QTextCodec::setCodecForCStrings() is a nice convenience, it is recommended to use only ASCII characters in source files. The C++ standard only mandates ASCII support and does not specify what encodings are to be supported by the compiler. A string may be initialized with the euro character (U+20AC) in any of the following ways:

QString euro1 = QString::fromUtf8("0AC"); // the eans Unicode sequence defined by c++ standard. ncodes the codepoint in UTF-8
QString euro2 = QChar(0x20AC);
static const char utf8_euro[] = "\342\202\254"; // Euro symbol
QString euro3 = QString::fromUtf8(utf8_euro, sizeof(utf8_euro));

All the above techniques require the source file to be only ASCII encoded.

Unicode methods in QString

A QChar represents a unicode code point. QString::unicode() returns the QChars of a QString. QString::utf16() returns ushort *. Notice that the function is not named toUtf16() because there is no conversion involved since the internal representation of QString is UTF-16.

QString::normalized() can be used for Unicode composition and decomposition.

A QChar is always 16-bit. Surrogate pairs are represented using multiple QChars. QChar::isHighSurrogate and QChar::isLowSurrogate can be used to get the surrogate order. QChar::unicode() will return the values. QChar::cell() and QChar::row() can be used to get the lower byte and the higher byte of the QChar.

QString::length() represents the number of QChars. Thus, it can be that the length does not actually refer to number of actual characters (when the string contains supplementary characters).

QString::toUtf8(), QString::fromUtf8(), QString::toUcs4(), QString::fromUcs4() help in UTF-8 and UTF-32 conversion.

Disabling QString(char *)

Even though the automatic conversion from C-style string to QString is convenient, it is often the source of many subtle bugs when using third party libraries. Qt provides an option of disabling automatic conversion from C-style strings to QString. For example,

void gitCallback(const char *data)
{
    QString string = data; // compile error. makes the author think about encoding of 'data'
    .
}

Compile errors from above make the programmer rethink about using QString (maybe a QByteArray is a better option) and also try to figure out the encoding of the C-style string.

By defining the macro QT_NO_CAST_FROM_ASCII, the automatic conversion from C-strings to QString using QString::fromAscii() is disabled and results in a compile error. After adding the define, the above code should be changed to

if (fruit==QString::fromUtf8("banana")) {  } // make explicit mention of encoding