QString: Difference between revisions

Latest revision as of 11:54, 19 April 2019

Written by Girish Ramakrishnan, ForwardBias Technologies

The fundamentals of encoding are covered in Basics of String Encoding. QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it's contents represent. This is in contrast to a C-style string (char*) that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.

Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. One main reason to use UTF-16 as the internal representation is that it makes it fast to use them with native unicode APIs' on the Mac OS X and Windows (which expect UTF-16).

For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See Using QByteArray for more details.

Using C-style strings with QString

QString string("Qt");

The above code is saved in some file with encoding called the the input charset. The compiler generates code that puts the C-style string "Qt" in memory with possibly some other encoding called the exec charset. At run time, QString gets a pointer to this memory location and needs to interpret and convert the bytes to unicode.

For converting the C-style string to Unicode, QString needs to know the exec charset. By default, Qt assumes that this is ASCII. Internally, this conversion uses the same code path as QString::fromAscii(). QString::fromAscii(), in turn, attempts to decode the characters as Latin-1 (since Ascii and Latin-1 are compatible). It is thus possible to get away with placing Latin-1 characters in C-strings.

QTextCodec::setCodecForCStrings(exec-charset) can be used to change the encoding that Qt uses to decode C-style strings. Calling this function makes QString::fromAscii() decode C-style strings using the new charset (in other words, it doesn't decode ASCII anymore).

The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not ASCII. A common case this occurs is when source files contain non-ASCII characters. Such source files are saved as UTF-8 and the exec charset of the compiler is set to UTF-8. QTextCodec::setCodecForCStrings("UTF-8") can then be used to make Qt interpret all the char* pointers correctly as UTF-8.

Even though QTextCodec::setCodecForCStrings() is a nice convenience, it is recommended to use only ASCII characters in source files. The C++ standard only mandates ASCII support and does not specify what encodings are to be supported by the compiler. A string may be initialized with the euro character (U+20AC) in any of the following ways:

QString euro1 = QString::fromUtf8("0AC"); // the eans Unicode sequence defined by c++ standard. ncodes the codepoint in UTF-8
QString euro2 = QChar(0x20AC);
static const char utf8_euro[] = "\342\202\254"; // Euro symbol
QString euro3 = QString::fromUtf8(utf8_euro, sizeof(utf8_euro));

All the above techniques require the source file to be only ASCII encoded.

Unicode methods in QString

A QChar represents a unicode code point. QString::unicode() returns the QChars of a QString. QString::utf16() returns ushort *. Notice that the function is not named toUtf16() because there is no conversion involved since the internal representation of QString is UTF-16.

QString::normalized() can be used for Unicode composition and decomposition.

A QChar is always 16-bit. Surrogate pairs are represented using multiple QChars. QChar::isHighSurrogate and QChar::isLowSurrogate can be used to get the surrogate order. QChar::unicode() will return the values. QChar::cell() and QChar::row() can be used to get the lower byte and the higher byte of the QChar.

QString::length() represents the number of QChars. Thus, it can be that the length does not actually refer to number of actual characters (when the string contains supplementary characters).

QString::toUtf8(), QString::fromUtf8(), QString::toUcs4(), QString::fromUcs4() help in UTF-8 and UTF-32 conversion.

Disabling QString(char *)

Even though the automatic conversion from C-style string to QString is convenient, it is often the source of many subtle bugs when using third party libraries. Qt provides an option of disabling automatic conversion from C-style strings to QString. For example,

void gitCallback(const char *data)
{
    QString string = data; // compile error. makes the author think about encoding of 'data'
    ….
}

Compile errors from above make the programmer rethink about using QString (maybe a QByteArray is a better option) and also try to figure out the encoding of the C-style string.

By defining the macro QT_NO_CAST_FROM_ASCII, the automatic conversion from C-strings to QString using QString::fromAscii() is disabled and results in a compile error. After adding the define, the above code should be changed to

if (fruit==QString::fromUtf8("banana")) { … } // make explicit mention of encoding

@@ Line 1: / Line 1: @@
-'''English''' [[QtQString Korean|한국어]]
+{{LangSwitch}}
 [[Category:QtInternals]]
+''Written by Girish Ramakrishnan, ForwardBias Technologies''
-[toc align_right="yes" depth="1"]
+The fundamentals of encoding are covered in [[Basics of String Encoding]].
+QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it's contents represent. This is in contrast to a C-style string (char*) that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.
-Written By : Girish Ramakrishnan, ForwardBias Technologies
-= QString =
-The fundamentals of encoding are covered in "BasicsOfStringEncoding":http://developer.qt.nokia.com/wiki/BasicsOfStringEncoding.
-QString stores unicode strings. By definition, since QString stores unicode, a QString knows what characters it's contents represent. This is in contrast to a C-style string (char ''') that has no knowledge of encoding by itself. A QString can be rendered on the screen or to a printer, provided there is a font to display the characters that the QString holds. All user-visible strings in Qt are stored in QString.
 Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. One main reason to use UTF-16 as the internal representation is that it makes it fast to use them with native unicode APIs' on the Mac OS X and Windows (which expect UTF-16).
-For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See "UsingQByteArray":http://developer.qt.nokia.com/wiki/UsingQByteArray for more details.
+For processing a C-style char-pointer or an array of bytes, QByteArray should be used instead of QString. See [[Using QByteArray]] for more details.
-h1. Using C-style strings with QString
+== Using C-style strings with QString ==
 <code>
- QString string("Qt");
+QString string("Qt");
 </code>
@@ Line 28: / Line 21: @@
 QTextCodec::setCodecForCStrings(exec-charset) can be used to change the encoding that Qt uses to decode C-style strings. Calling this function makes QString::fromAscii() decode C-style strings using the new charset (in other words, it doesn't decode ASCII anymore).
-The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not ASCII. A common case this occurs is when source files contain non-ASCII characters. Such source files are saved as UTF-8 and the exec charset of the compiler is set to UTF-8. QTextCodec::setCodecForCStrings("UTF-8") can then be used to make Qt interpret all the char''' pointers correctly as UTF-8.
+The only reason to use QTextCodec::setCodecForCStrings is when the exec charset is not ASCII. A common case this occurs is when source files contain non-ASCII characters. Such source files are saved as UTF-8 and the exec charset of the compiler is set to UTF-8. QTextCodec::setCodecForCStrings("UTF-8") can then be used to make Qt interpret all the char* pointers correctly as UTF-8.
 Even though QTextCodec::setCodecForCStrings() is a nice convenience, it is recommended to use only ASCII characters in source files. The C++ standard only mandates ASCII support and does not specify what encodings are to be supported by the compiler. A string may be initialized with the euro character (U+20AC) in any of the following ways:
 <code>
- QString euro1 = QString::fromUtf8("0AC"); // the eans Unicode sequence defined by c++ standard. ncodes the codepoint in UTF-8
+QString euro1 = QString::fromUtf8("0AC"); // the eans Unicode sequence defined by c++ standard. ncodes the codepoint in UTF-8
- QString euro2 = QChar(0x20AC);
+QString euro2 = QChar(0x20AC);
- static const char utf8_euro[] = "42\202\254"; // Euro symbol
+static const char utf8_euro[] = "\342\202\254"; // Euro symbol
- QString euro3 = QString::fromUtf8(utf8_euro, sizeof(utf8_euro));
+QString euro3 = QString::fromUtf8(utf8_euro, sizeof(utf8_euro));
 </code>
 All the above techniques require the source file to be only ASCII encoded.
-= Unicode methods in QString =
+== Unicode methods in QString ==
 A QChar represents a unicode code point. QString::unicode() returns the QChars of a QString. QString::utf16() returns ushort *. Notice that the function is '''not named''' toUtf16() because there is no conversion involved since the internal representation of QString is UTF-16.
@@ Line 52: / Line 45: @@
 QString::toUtf8(), QString::fromUtf8(), QString::toUcs4(), QString::fromUcs4() help in UTF-8 and UTF-32 conversion.
-= Disabling QString(char *) =
+== Disabling QString(char *) ==
 Even though the automatic conversion from C-style string to QString is convenient, it is often the source of many subtle bugs when using third party libraries. Qt provides an option of disabling automatic conversion from C-style strings to QString. For example,
 <code>
- void gitCallback(const char *data)
+void gitCallback(const char *data)
- {
+{
- QString string = data; // compile error. makes the author think about encoding of 'data'
+    QString string = data; // compile error. makes the author think about encoding of 'data'
- ….
+    ….
- }
+}
 </code>
@@ Line 67: / Line 60: @@
 By defining the macro QT_NO_CAST_FROM_ASCII, the automatic conversion from C-strings to QString using QString::fromAscii() is disabled and results in a compile error. After adding the define, the above code should be changed to
 <code>
- if (fruit== QString::fromUtf8("apple")) { … } // make explicit mention of encoding
+if (fruit==QString::fromUtf8("banana")) { … } // make explicit mention of encoding
 </code>
-= Further reading =