Qt-contributors-summit-2013-18836

From Qt Wiki
Jump to: navigation, search
Revision from: 14:54, 8 Jul 2013

WIP

ICU and Qt

At QtCS 2012 it was decided to migrate to using ICU for all localization services, effectively making it a hard requirement for Qt5. Since then a number of technical and political objections have been raised which need to be addressed. This session is to discuss the problems and try come up with a solution.

Current use of ICU

  • QtWebKit – required on all platforms for localization and text layout.
  • QtCore / QIcuCodec – private, optional
  • QtCore / QCollator – private, optional
  • QtCore / QLocale for toUpper() and toLower() – private, optional
  • sqlite3?

Proposed use of ICU

It was proposed to use ICU for all localization functions and data on all platforms, both the default locale and for any custom locales., and possibly for more Unicode code tables, etc. This would make ICU a hard requirement for all Qt5 builds.

Issues with ICU

Common

ICU are notoriously bad at building libraries that can be reliably linked against, they offer no BC guarantees for C++, changing so names, dat files tightly coupled to library version, etc. For compatibility this leaves us to use the C api which lacks many of the advanced features in the C++ api that make using it desirable. C API additions to match can be requested and some have been added as a result, but old system versions shipped in OSX and Linux won't have these available. To use the C++ api would require strictly controlling the version of ICU linked to, i.e. building it ourselves.

ICU respects the users locale code (e.g. en_GB) but doesn't use the system default settings or the users custom settings (e.g. different date format), so ICU apps may not fully fit in with a users environment. This may especially be a problem on Windows where settings and behaviour are very different from the POSIX world.

The default ICU data package is fairly large, up to 18MB, but not all data is actually required such as code tables and could easily be shrunk to about 4MB. ICU libraries and approx size:

Library Linux Filename Linux Size
Data Library libicudata 17 MB
Common Library libicuuc 1.5 MB
i18n Library libicu18n 2 MB
I/O Library (optional) libicuio 55 KB
Font Layout Library (optional) libicule 265 KB
Font Layout Extension Library (optional) libiculx 50 KB
Tool Utility Library (optional) libicutu 170 KB

QtWebKit

WebKit provides a localization, text layout and string encoding abstraction layer. WebKit ports such as QtWebKit can provide their own backend but apparently most choose to use the existing ICU back-end for convenience. QtWebKit4 apparently used to use QLocale/QString but switched to ICU for Qt5? This means QtWebKit needs and uses ICU on all platforms and so may not always properly fit in, whereas other system-provided WebKit ports may actually use the native api and so fit in better. Need to determine exactly what QtWebKit uses from ICU and whether current approach is still best. Also an issue that by using ICU for localization may get different results than QLocale may provide for the rest of the app.

WebKit / QtWebKit has 4 copies of the ICU headers included in it's source tree required to build on Mac 10.4.

WebKit / QtWebKit uses the ICU C++ api meaning it is tied to the specific ICU version it is built against, so Linux distro's must already be coping with the ICU upgrade situation, i.e rebuildign QtWebKit and Chromium as required.

See https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/eSiHBND2rAQ/X9LyMHImNjgJ for an interesting discussion on using ICU in WebKit / Blink.

Build appears to only link against core and i18n libraries.

qtwebkit/Source/WebCore/platform/text/
LineBreakIteratorPoolICU.h
LocaleICU.h/.cpp
LocaleToScriptMappingICU.cpp
TextBreakIteratorICU.h/.cpp
TextCodecICU.h/.cpp
TextEncodingDetector.cpp

All ICU includes in QtWebKit, not all are built by QtWebKit:
Source/JavaScriptCore/runtime/DatePrototype.cpp:#include <unicode/udat.h>
Source/WebCore/editing/SmartReplaceICU.cpp:#include <unicode/uset.h>
Source/WebCore/editing/TextIterator.cpp:#include <unicode/usearch.h>
Source/WebCore/platform/graphics/SurrogatePairAwareTextIterator.cpp:#include <unicode/unorm.h>
Source/WebCore/platform/graphics/win/SimpleFontDataCGWin.cpp:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/win/SimpleFontDataCGWin.cpp:#include <unicode/unorm.h>
Source/WebCore/platform/graphics/blackberry/FontCacheBlackberry.cpp:#include <unicode/utf16.h>
Source/WebCore/platform/graphics/blackberry/skia/PlatformSupport.cpp:#include <unicode/utf16.h>
Source/WebCore/platform/graphics/freetype/SimpleFontDataFreeType.cpp:#include <unicode/normlzr.h>
Source/WebCore/platform/graphics/harfbuzz/ComplexTextControllerHarfBuzz.cpp:#include <unicode/normlzr.h>
Source/WebCore/platform/graphics/harfbuzz/ComplexTextControllerHarfBuzz.h:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/harfbuzz/HarfBuzzShaperBase.cpp:#include <unicode/normlzr.h>
Source/WebCore/platform/graphics/harfbuzz/HarfBuzzShaperBase.cpp:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/harfbuzz/ng/HarfBuzzShaper.cpp:#include <unicode/normlzr.h>
Source/WebCore/platform/graphics/harfbuzz/ng/HarfBuzzShaper.cpp:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/wx/SimpleFontDataWx.cpp:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/wx/SimpleFontDataWx.cpp:#include <unicode/unorm.h>
Source/WebCore/platform/graphics/wx/GlyphMapWx.cpp:#include <unicode/utf16.h>
Source/WebCore/platform/graphics/chromium/FontCacheChromiumWin.cpp:#include <unicode/uniset.h>
Source/WebCore/platform/graphics/chromium/UniscribeHelper.h:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/chromium/FontUtilsChromiumWin.cpp:#include <unicode/locid.h>
Source/WebCore/platform/graphics/chromium/FontUtilsChromiumWin.cpp:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/chromium/FontCacheAndroid.cpp:#include <unicode/locid.h>
Source/WebCore/platform/graphics/chromium/SimpleFontDataChromiumWin.cpp:#include <unicode/uchar.h>
Source/WebCore/platform/graphics/chromium/SimpleFontDataChromiumWin.cpp:#include <unicode/unorm.h>
Source/WebCore/platform/graphics/chromium/FontUtilsChromiumWin.h:#include <unicode/uscript.h>
Source/WebCore/platform/graphics/skia/SimpleFontDataSkia.cpp:#include <unicode/normlzr.h>
Source/WebCore/platform/graphics/skia/FontCacheSkia.cpp:#include <unicode/locid.h>
Source/WebCore/platform/text/TextEncoding.cpp:#include <unicode/unorm.h>
Source/WebCore/platform/text/LocaleToScriptMappingICU.cpp:#include <unicode/uloc.h>
Source/WebCore/platform/text/TextCodecICU.h:#include <unicode/utypes.h>
Source/WebCore/platform/text/LocaleICU.h:#include <unicode/udat.h>
Source/WebCore/platform/text/LocaleICU.h:#include <unicode/unum.h>
Source/WebCore/platform/text/LineBreakIteratorPoolICU.h:#include <unicode/ubrk.h>
Source/WebCore/platform/text/LocaleICU.cpp:#include <unicode/udatpg.h>
Source/WebCore/platform/text/LocaleICU.cpp:#include <unicode/uloc.h>
Source/WebCore/platform/text/TextCodecICU.cpp:#include <unicode/ucnv.h>
Source/WebCore/platform/text/TextCodecICU.cpp:#include <unicode/ucnv_cb.h>
Source/WebCore/platform/KURL.cpp:#include <unicode/uidna.h>
Source/WebKit/blackberry/Api/WebPage.cpp:#include <unicode/ustring.h> // platform ICU
Source/WebKit/chromium/public/WebSettings.h:#include <unicode/uscript.h>
Source/WebKit/chromium/src/linux/WebFontInfo.cpp:#include <unicode/utf16.h>
Source/WTF/wtf/url/src/URLCanonICU.cpp:#include <unicode/ucnv.h>
Source/WTF/wtf/url/src/URLCanonICU.cpp:#include <unicode/ucnv_cb.h>
Source/WTF/wtf/url/src/URLCanonICU.cpp:#include <unicode/uidna.h>
Source/WTF/wtf/unicode/icu/UnicodeIcu.h:#include <unicode/uchar.h>
Source/WTF/wtf/unicode/icu/UnicodeIcu.h:#include <unicode/uscript.h>
Source/WTF/wtf/unicode/icu/UnicodeIcu.h:#include <unicode/ustring.h>
Source/WTF/wtf/unicode/icu/UnicodeIcu.h:#include <unicode/utf16.h>
Source/WTF/wtf/unicode/icu/CollatorICU.cpp:#include <unicode/ucol.h>
Source/WTF/wtf/unicode/qt4/UnicodeQt4.h:#include <unicode/ubrk.h>

Windows

See http://thread.gmane.org/gmane.comp.lib.qt.devel/9226

ICU is not shipped with Windows so all apps need to build and distribute their own copy of ICU including data. Devs have complained that ICU is too big (!) and too hard to build.

Binary downloads are available from ICU, but only for mvsc10 and not debug versions which causes issues.

Work was started by Kai to determine what data resources were required and not required, but no results have been published.

Work required to determine if sufficient host api functionality available for system locale, and if locale resource bundles can be opened for custom locales.

For localization it seems unlikely the native API will provide a sufficient solution so we need to make the build easier and the data smaller, or accept a lesser feature set for apps that don't wish to ship ICU. This may not be an issue for most apps that don't require the more advanced locale features or custom locales. Those that do need the features will be more willing to make the effort.

Mac

ICU ships as standard on OSX and iOS, with the official API classes effectively thin wrappers around ICU. Shipped versions of ICU tend to be rather old, and the headers are not included to discourage the direct use of ICU. Currently to use ICU on OSX requires installing MacPorts and linking to their version of ICU, but this is a bad solution as it is not portable or distributable (it also causes build problems if Macports Qt4 is also installed). It is possible to download the headers from opensource.apple.com with some effort and use those instead. However apps are rejected from the App Store if they directly link to ICU which effectively rules out using the system ICU for anything on iOS and thus for simplicity on OSX too. It is not clear if shipping a self-built copy of ICU is acceptable to the App Store, but the extra 20Mb added to the download is not likely to be acceptable to iOS developers so would need to be trimmed down.

ICU provides a 64bit binary download of 11Mb.

For localization the native API will probably be sufficient. Code tables require investigation. This leaves the issue of QtWebKit on iOS, need to find out what other WebKit users do, but they likely use the iOS version of WebKit.

Linux

ICU is available on all distro's and likely to always be installed due to other projects depending on it so is not a problem for availability or download size. Distro's are used to the issues involved with using ICU so it would not be unreasonable to make it a requirement to rebuild Qt whenever a major ICU update is made. In fact, this seems to be existing policy for other packages using ICU.

Android

Android ships ICU. Needs more investigation on how it is used and if there are any problems with linking to it directly instead of the native API.

BB10 / QNX

QNX ships ICU. Needs more investigation on how it is used and if there are any problems with linking to it directly instead of the native API.

Possible solutions

It seems clear we cannot rely on the system ICU on Mac, and there is no system install on Windows, which swings the platform balance to 2 to 1 on devs having to build and ship ICU themselves. While forcing a self-build is extra work for devs, it does have the benefit of allowing us to use the C++ api.

Practically, there are three options depending on what features we choose to use from ICU:
1) Only use ICU on Linux for the host system localization, use the host api on Mac and Windows, and don't provide any advanced features that are not common to all three api's. The same would apply to Code Tables.
2) Keep ICU optional on Win/Mac, use Win/Mac host system api wherever possible and only require ICU for optional advanced features when devs will be motivated to build and ship ICU.
3) Make ICU a hard requirement to be built and shipped by all devs on Mac and Windows, and require Linux devs to either build and ship themselves or use the system install and always rebuild Qt on system ICU upgrades.

Choosing 1) denies advanced features and doesn't solve the QtWebKit issue. In either of 2) or 3) devs will be faced with the need to build and ship ICU themselves and we need to make this easy and lightweight for them. Shipping all of ICU inside qtbase/3rdparty with a default config is not desirable, but nor can we expect all devs to suddenly become experts on building ICU. We should provide a simple build script that configures and enables those features in ICU that Qt uses, and allow the devs to choose what locales and code tables to ship.

One advantage of requiring our own copy of ICU is we can set a minimum version that has all the features we want to use on all platforms, and can use the C++ API.

Localization

The QTimeZone code provides a template for the solution. Keep the concept of a default system locale that uses the host facilities directly, but at compile time can determine if want to use ICU instead. This means maintaining more code but seems the only practical solution.

On Linux: Use ICU for system and custom locale.
On Mac: Use standard API for system and custom locale.
On Windows: Work required to determine if sufficient functionality available for system locale, and if locale resource bundles can be opened for custom locales, otherwise will have to use ICU.

Data size / build

A number of options are available to reduce the size of both the library and the data, by not building some features and reducing the data shipped for those features that are enabled.

The data can be built in 2 ways:
1) The default is built as a shared data library that is linked to and loaded alongside the main ICU library. This library is generated from a copy of the data files in the source tree. This is the fastest option but means the library must be updated if the data is to be updated, and is not portable across platforms.
2) The shared data library is built as a stub and the data is loaded from a .dat file located in a defined directory. The .dat file is specific to a given major and minor release, but maintains BC for point releases and is portable across some platforms that have the same endianess.

Data is mmapped so memory usage is not affected by how much data is shipped.

Features can be disabled at build time by either editing the uconfig.h, uversion.h and utypes.h files, or more practically by passing -D flags.

Data resources that are not required can be removed to reduce the download size. This is done at build time by either directly modifying the original .mk files, or more practically by saving the modified options in new reslocal.mk files which will the ICU build system will then use to override the original settings. Another option is to manually use the online data customiser to build a custom .dat file, but this is a manual interactive process not easily automated and may be prone to error.

Most data is the mapping conversion tables, removing these will have the greatest effect. ICU notes "ICU provides full internationalization functionality without any conversion table data. The common library contains code to handle several important encodings algorithmically: US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., US-ASCII, ISO-8859-1, and all Unicode charsets; see source/data/mappings/convrtrs.txt for the current list)." As such even if Qt uses ICU for conversions we may not need any of the conversion tables.

Locale data takes a small proportion, but may also be reduced by removing uncommon locales, or allowing devs to choose which they want.

Collation data uses a significant amount of data. Removing the Asian collation files would greatly reduce this but is probably undesirable. Another option is to remove the tailoring rule strings from which the data is built which are rarely used at runtime.

Proposal:
1) Include a new build script in qtbase/3rdparty/icu
2) Script is run as part of configure depending on platform and flags
3) Script downloads the src tarball recommended for a given version of Qt
4) Script defaults to building only those features used by Qt on a given platform and removes data resources that are not needed by most clients.
5) Script can either be manually modified to include or excluded more features and data, or can interactively ask during configure step.
6) Script writes modified data options to icu/src/data/*/reslocal.mk build files which override the main *.mk files
7) Script runs configure with build-time options required
8) Build happens as part of normal Qt build.

Suggested flags from ICU readme:
U_USING_ICU_NAMESPACE=0
U_CHARSET_IS_UTF8=1 – On UTF8 platforms
UNISTR_FROM_CHAR_EXPLICIT=explicit
UNISTR_FROM_STRING_EXPLICIT=explicit
U_NO_DEFAULT_INCLUDE_UTF_HEADERS=1
U_HIDE_DRAFT_API
U_HIDE_INTERNAL_API
U_HIDE_SYSTEM_API
—with-library-suffix – Add Qt as a suffix to name

Other options available:
—with-data-packaging=archive – To use .dat file instead
—enable-static —disable-shared – For static builds

QtWebKit

Two options:
1) Continue using ICU, use new QtCore built copy of ICU, assumes iOS will accept shipping extra copy of ICU.
2) Write new platform back-end using new Qt classes for locale and text layout, but this is a lot of work.

Option 1) is the only practical solution until such time as QtCore can provide all the required functions. The locale functions will come as a result of the new QLocale ICU backend and wrapper classes, and as the design matches ICU closely should be straightforward to implement. The difficulty of the text layout and encoding back-ends is an open question.

ICU Documentation

http://site.icu-project.org/charts/charset
http://site.icu-project.org/charts/icu4c-footprint
http://userguide.icu-project.org/packaging
http://userguide.icu-project.org/design
http://userguide.icu-project.org/design#TOC-ICU-Binary-Compatibility:-Using-ICU-as-an-Operating-System-Level-Library
http://userguide.icu-project.org/icudata
http://apps.icu-project.org/datacustom/
http://www.icu-project.org/docs/demo/datacustom_help.html
http://source.icu-project.org/repos/icu/icu/trunk/readme.html