Qt-contributors-summit-2013-Qt ICU

From Qt Wiki
Revision as of 17:41, 12 March 2015 by AutoSpider (talk | contribs) (Decode HTML entity names)
Jump to navigation Jump to search
This article may require cleanup to meet the Qt Wiki's quality standards. Reason: Auto-imported from ExpressionEngine.
Please improve this article if you can. Remove the {{cleanup}} tag and add this page to Updated pages list after it's clean.

WIP

Agenda

  • Review decision on using ICU everywhere in Qt
  • Discuss QTimeZone integration into QDateTime: initialization, validation and daylight saving transitions

Minutes

Take by David Faure

QSystemLocale picks up the user settings (bypassing ICU, which doesn’t do that).
So it needs platform backends to do that.

Same for the user-selected timezone, etc.
=> ICU is really just the database.

We need to complete that code, to get all settings on all OSes.

ICU data is 26 MB on disk (on Windows). The bulk is timezones. Lots of translations. 4.6 MB with just the english locale and language.

ICU4C lib = 5MB on top.

Idea from Thiago: QLocale – basic, based on host apis Something more complete based on ICU -> in a separate module Could be done with QCollator, calendering and other future stuff, but not very consistent

Idea from Lars: a stripped ICU. Translations would come from Qt po files.

Another option: load ICU dynamically, and use system apis if it’s not available.

Dropped idea: a plugin (loaded with native APIs, no QString), which would also exist with
minimal, small, and full versions. Creates a deployment problem.

Conclusions:

Mac: we’ll use the Mac API wrapper

Android: TODO: check if we can use the ICU data files from ICU4J (same as the ones from ICU4C) TODO: check if we can use that data, even if the lib isn’t available. It should be the same version of ICU though! Or we could use JNI….. (everyone is horrified by the idea though).

Idea from Lars: an ICU-compatible replacement API, that can be shipped instead of ICU to save size.
QtCore uses 26 symbols from ICU.
So the user could choose which ICU to use: 1) minimal (just C), 2) small (C and one selected locale?), 3) full.
Whenever using QtWebKit, the full one is required. It’s really mutually exclusive (webkit uses 85 symbols from icu)

ICU and Qt

At QtCS 2012 it was decided to migrate to using ICU for all localization services to reduce our own code and data, and possibly for code conversion tables, effectively making it a hard requirement for Qt5. Since then a number of technical and political objections have been raised which need to be addressed. This session is to discuss the problems and try come up with a solution.

Primary issues:

  • ICU ignores host data and any user custom settings, so may not appear ‘native’
  • BC issues with using system libraries limit us to using the C api which lacks features we may need
  • Mac doesn’t ship system library headers, App Store bans direct linking to libicu, out-dated system versions
  • Complaints from Windows devs that ICU download is too big to ship and has no debug version, but too hard to build / shrink the build themselves
  • Android doesn’t guarantee ICU4C will be installed(?), but offers no other NDK localization option(?)
  • Tizen also doesn’t ship ICU headers or allow linking to system ICU in its app store
  • QtWebKit has hard requirement that is not easily removed

This essentially means we must either use only native api on Win and Mac, or ship our own minimal version of ICU.

A default ICU 50.1 build with full data is 26.5 MB on disk or 10.5 MB zipped. Reduced to the minimum library and only the English locale, translation and mapping data the build is 10 MB on disk or 4.6 MB zipped. Further optimizations possible. An app choosing to ship ICU with a reasonable number of supported language translations can expect a build of about 12 MB on disk or 6 MB zipped. The big decision is whether this is an acceptable size for downloads on Win, Mac and Android. (see below).

Practically, there are three options depending on what features we need to use from ICU:
1) Only use the host api on Mac and Windows, use ICU as the host api on Linux, and don’t provide any features that are not common to all three api’s.
2) Use Win/Mac host system api wherever possible, only require ICU for optional features where devs will be motivated to build and ship ICU.
3) Make ICU as a hard requirement to be always be built and shipped by all devs on Mac and Windows.

Major questions:

  • Are host system api sufficient for requirements? Mac yes for localization as thin wrapper to ICU, WinRT appears to be modelled on ICU, Win32 needs research but seems doubtful. Not clear if Windows allows opening resource files for non-system/custom locales.
  • Android / BB10 / QNX / Tizen details?

Side note: The Chromium/Blink project has debated these same issues and decided to always build and ship their own version of ICU on all platforms including Linux and Android, wrapped in a thin abstraction layer.

It is proposed to follow option 2) until such time as we need features only ICU or the the C++ api can provide. An extension of option 3) is to also force our own build on Linux which would allow us to use the C++ api.

  • New build script in qtbase/3rdparty to checkout, configure and build minimal required ICU as part of qtbase standard build system
  • Build script able to be customized to choose what locales and translations to ship
  • Localization to always use host api for system locale, any features needing ICU must be optional build-time flag
  • Script detects if ICU features are needed, Win/Mac/iOS/Android build and link own copy of libicu, Linux/QNX defaults to system library but can choose own copy if needed
  • Only use ICU C api for now, but require a recent enough version on all platforms to be useful, if not available from system then must build own
  • Minimal non-ICU C locale for embedded if required?

QTimeZone / QDateTime integration issues

The current implementation of QDateTime with the system time zone has a number of implementation issues around initialization, validity checking, and maths that will also affect QTimeZone so should have their behaviour defined or fixed before QTimeZone is integrated.

  • QTimeZone uses a lazy initialization that accepts any date, time and spec, validity is only checked when used
  • isValid() does not take the time zone into account, only if QDate and QTime are individually valid
  • Date-only math functions (add day/month/year) only done in QDate, i.e. validity check and maths applied is on date only and doesn’t consider time and time zone, nor whether the result is valid in the tz.
  • Time math functions do use QDateTIme::isValid() and convert to UTC to calculate
  • Changing spec (i.e. to UTC in calculations) calls mktime to calculate and validate,
  • The transition from Standard Time to Daylight Time (e.g. 2am becomes 3am) leaves a ‘hole’ of 1 hour that should be considered invalid but isValid() returns true and the date and time maths functions are still applied

It has been proposed to re-write QDateTime to internally store as an absolute msecs since epoch which would inherently solve many of these issues, but this would radically change the behaviour of QDateTime which currently treats the ymd/hms values as “fixed” and the time spec is used to interpret that value. This would mostly affect the default SystemTime spec where the system time zone can change underneath QDateTime causing the absolute UTC value to change. A considerable re-write would be required to keep the behaviour consistent. There is also the behaviour that you can store an invalid date but valid time in QDateTime, and later fix the date to be valid, which may not be possible if using a single qint64. Other likely effects would be:

  • Creation would be slower as has to validate and convert
  • Accessing most commonly used functions would be slower, e.g. caliing dt.date().day or dt.time().hour()
  • Maths and conversion functions would be faster and simpler and more accurate
  • Memory footprint would be reduced by one-third

If we keep storage in QDate/QTime format we have to ensure validity is checked properly. This solution is a lot less code change and keeps current behaviour, but

  • QDateTime::isValid() must check if valid in tz, i.e. call mktime the first time called then cache result in QDateTImePrivate::Spec
  • All date-only maths needs to be converted to UTC first then converted back, same as for time

A further major issue is the Second Occurrence:

  • The transition from Daylight Time to Standard Time (e.g. 3am become 2am) has an hour that ‘occurs’ twice and is thus ambiguous, i.e. 2:30am. We currently have no api to indicate or set which occurrence.
  • Windows, Linux and Mac implementations of mktime have different assumptions for ambiguous times: Linux assumes first occurrence, Windows assumes second occurrence, Mac assumes first for about the first 40 minutes and second for the last 20 minutes (probably a bug).
  • The mktime tm_isdst flag is not sufficient for certain scenarios

The current time zone patches have api for setting and reading the occurrence, but issues with mktime have prevented working code so far.

Development plan:

  • Re-submit patches for offsetFromUtc and cleaning up format/parse code
  • Implement chosen solution for QDateTIme internals, other clean-ups
  • Implement second occurrence support for SystemTime only
  • Re-submit QTimeZone patches

Detailed notes on ICU in Qt5

Current use of ICU in Qt5

  • QtWebKit – required on all platforms for localization and text layout.
  • QtCore / QIcuCodec – private, optional
  • QtCore / QCollator – private, optional
  • QtCore / QLocale for toUpper() and toLower() – private, optional
  • sqlite3?

Issues with ICU

Common

ICU are notoriously bad at building libraries that can be reliably linked against, they offer no BC guarantees for C++, changing so names, dat files tightly coupled to library version, etc. For compatibility this leaves us to use the C api which lacks many of the advanced features in the C++ api that make using it desirable. C API additions to match can be requested and some have been added as a result, but old system versions shipped in OSX and Linux won’t have these available. To use the C++ api would require strictly controlling the version of ICU linked to, i.e. building it ourselves.

ICU respects the users locale code (e.g. en_GB) but doesn’t use the host system data or the users custom settings (e.g. different date format), so ICU apps may not fully fit in with a users environment. This may especially be a problem on Windows where settings and behaviour are very different from the POSIX world.

The default ICU data is fairly large, up to 21.3 MB on disk or 8.5 MB zipped, but not all data is actually required, such as code tables or translations for languages not supported by an app, and could be easily shrunk by apps to about 7MB on disk or 3.5 MB zipped.

Windows

See http://thread.gmane.org/gmane.comp.lib.qt.devel/9226

ICU is not shipped with Windows so all apps need to build and distribute their own copy of ICU including data. Devs have complained that ICU is too big (!) and too hard to build.

A binary download of 11.4 MB is available from ICU, but only for mvsc10 and not debug versions which causes issues.

Work required to determine if sufficient host api functionality available for system locale, and if locale resource bundles can be opened to use for custom locales.

For localization it seems unlikely the native API will provide a sufficient solution (especially on XP) so we need to make the build easier and the data smaller, or accept a lesser feature set for apps that don’t wish to ship ICU. This may not be an issue for most apps that don’t require the more advanced locale features or custom locales. Those that do need the features will be more willing to make the effort.

Mac

ICU ships as standard on OSX and iOS, with the official API classes effectively thin wrappers around ICU that allow for user customisations. Shipped versions of ICU tend to be rather old, and the headers are not included to discourage the direct use of ICU. Currently Qt5 requires installing MacPorts and linking to their version of ICU, but this is a bad solution as it is not portable or distributable (it also causes build problems if Macports Qt4 is also installed). It is possible to download the headers from opensource.apple.com with some effort and use those instead (WebKit, Chrome and others do this by including a copy of the headers). However apps are rejected from the App Store if they directly link to ICU which effectively rules out using the system ICU for anything on iOS and thus for simplicity on OSX too. It is not clear if shipping a self-built copy of ICU is acceptable to the App Store, but the extra 20Mb added to the download is not likely to be acceptable to iOS developers so would need to be trimmed down.

ICU provides a 64bit binary download of 11Mb.

For localization the native API will probably be sufficient. Code tables require investigation.

QtWebKit is not permitted on iOS by the App Store rules, apps must use the native WebKit install. Therefore iOS does not need to be considered in any QtWebKit solution.

Linux

ICU is available on all distro’s and likely to always be installed due to other projects depending on it so is not a problem for availability or download size. Distro’s are used to the issues involved with using ICU so it might not be unreasonable to make it a requirement to rebuild Qt whenever a major ICU update is made. In fact, this seems to be existing policy for other packages using ICU.

Android

Android ships ICU? Or just the Java version by default? Needs more investigation on how it is used and if there are any problems with linking to it directly instead of the native API.

ICU4C is not a standard part of Android, but is supported in Android External and builds fine. Appears to be no native C/C++ api to use in NDK, only Java native api. Not yet clear if should use Android src repo or ICU master repo. Appears we will need to build and ship our own copy.

Android repo at https://android.googlesource.com/platform/external/icu4c but doc at http://source.android.com/source/submit-patches.html#icu4c makes it clear upstream is considered the master.

See https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/eSiHBND2rAQ/X9LyMHImNjgJ for an interesting discussion on using ICU in WebKit / Blink / Android which seems to suggest ICU system will become standard install?

BB10 / QNX

QNX ships ICU. Needs more investigation on how it is used and if there are any problems with linking to it directly instead of the native API.

BB10 ships ICU in firmware to use for NDK localization and Qt4 in firmware to use for gui’s (and thus uses old QLocale). Qt5 can be built and used, but is not yet a standard install in firmware. This means Qt5 can use the system ICU as the locale back-end, i.e. same as Linux. Initially will only be able to use the C api, but once both are in firmware it will always be a monolithic system build so can probably use the C++ api.

QtWebKit

WebKit provides a localization, text layout and string encoding abstraction layer. WebKit ports such as QtWebKit can provide their own backend but most choose to use the existing ICU back-end for convenience. QtWebKit4 apparently used to use QLocale/QString but switched to ICU for Qt5? This means QtWebKit needs and uses ICU on all platforms and so may not always properly fit in, whereas other system-provided WebKit ports may actually use the native api and so fit in better. Need to determine exactly what QtWebKit uses from ICU and whether current approach is still best. Also an issue that by using ICU for localization may get different results than QLocale may provide for the rest of the app.

WebKit / QtWebKit has 4 copies of the ICU headers included in its source tree, used to build on Mac 10.4.

WebKit / QtWebKit mostly uses the ICU C api, but does occasionally use the C++ api in port specific code.

See https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/eSiHBND2rAQ/X9LyMHImNjgJ for an interesting discussion on using ICU in WebKit / Blink.

Build appears to only link against core and i18n libraries. See bottom for detailed breakdown of ICU includes used in WebKit.

QtWebKit is not permitted on iOS by the App Store rules, apps must use the native WebKit install. Therefore iOS does not need to be considered in any QtWebKit solution.

Possible solutions

It seems clear we cannot rely on the system ICU on Mac, and there is no system install on Windows, which swings the platform balance to 2 to 1 on devs having to build and ship ICU themselves. While forcing a self-build is extra work for devs, it does have the benefit of allowing us to use the C++ api.

Practically, there are three options depending on what features we choose to use from ICU:
1) Only use ICU on Linux for the host system localization, use the host api on Mac and Windows, and don’t provide any advanced features that are not common to all three api’s. The same would apply to Code Tables.
2) Keep ICU optional on Win/Mac, use Win/Mac host system api wherever possible and only require ICU for optional advanced features when devs will be motivated to build and ship ICU.
3) Make ICU a hard requirement to be built and shipped by all devs on Mac and Windows, and require Linux devs to either build and ship themselves or use the system install and always rebuild Qt on system ICU upgrades.

Choosing 1) denies advanced features and doesn’t solve the QtWebKit issue. In either of 2) or 3) devs will be faced with the need to build and ship ICU themselves and we need to make this easy and lightweight for them. Shipping all of ICU inside qtbase/3rdparty with a default config is not desirable, but nor can we expect all devs to suddenly become experts on building ICU. We should provide a simple build script that configures and enables those features in ICU that Qt uses, and allow the devs to choose what locales and code tables to ship.

One advantage of requiring our own copy of ICU is we can set a minimum version that has all the features we want to use on all platforms, and can use the C++ API.

Localization

The QTimeZone code provides a template for the solution. Keep the concept of a default system locale that uses the host facilities directly, but at compile time can determine if want to use ICU instead. This means maintaining more code but seems the only practical solution.

On Linux: Use ICU for system and custom locale.
On Mac: Use standard API for system and custom locale.
On Windows: Work required to determine if sufficient functionality available for system locale, and if locale resource bundles can be opened for custom locales, otherwise will have to use ICU.

Data size / build

A number of options are available to reduce the size of both the library and the data, by not building some features and reducing the data shipped for those features that are enabled.

Work was started by Kai to determine what data resources were required and not required, but no results have been published.

The data can be built in 2 ways:
1) The default is built as a shared data library that is linked to and loaded alongside the main ICU library. This library is generated from a copy of the data files in the source tree. This is the fastest option but means the library must be updated if the data is to be updated, and is not portable across platforms.
2) The shared data library is built as a stub and the data is loaded from a .dat file located in a defined directory. The .dat file is specific to a given major and minor release, but maintains BC for point releases and is portable across some platforms that have the same endianess.

Data is mmapped so memory usage is not affected by how much data is shipped.

Features can be disabled at build time by either editing the uconfig.h, uversion.h and utypes.h files, or more practically by passing -D flags.

Data resources that are not required can be removed to reduce the download size. This is done at build time by either directly modifying the original .mk files, or more practically by saving the modified options in new reslocal.mk files which the ICU build system will then use to override the original settings. Another option is to manually use the online data customiser to build a custom .dat file, but this is a manual interactive process not easily automated and may be prone to human error.

Most data is the mapping conversion tables, removing these will have the greatest effect. ICU notes “ICU provides full internationalization functionality without any conversion table data. The common library contains code to handle several important encodings algorithmically: US-ASCII, ISO-8859-1, UTF-7/8/16/32, SCSU, BOCU-1, CESU-8, and IMAP-mailbox-name (i.e., US-ASCII, ISO-8859-1, and all Unicode charsets; see source/data/mappings/convrtrs.txt for the current list).” As such even if Qt uses ICU for conversions we may not need all/any of the conversion tables.

Locale data takes a small proportion, but may also be reduced by removing uncommon locales, or allowing devs to choose which they want.

Collation data uses a significant amount of data. Removing the Asian collation files would greatly reduce this but is possibly undesirable. Another option is to remove the tailoring rule strings from which the data is built which are rarely used at runtime.

Build sizes

Build Disk Size Zipped Size
Default full build, full data 26.5 MB 10.5 MB
Core build, full data 25.9 MB 10.3 MB
Core build, all locales, only en translations 11.2 MB 5.0 MB
Core build, only en locale and translations 10.0 MB 4.6 MB

Core build excludes the optional I/O, Font Layout and Tool Utility libraries. Difference of 0.2 MB zipped size means little to be gained form code/functionality reductions, but other flags may further reduce size.

Library Linux Filename Linux Size
Common Library libicuuc 1.8 MB
i18n Library libicu18n 2.6 MB
Data Library libicudata 21.3 MB
I/O Library (optional) libicuio 66 KB
Font Layout Library (optional) libicule 440 KB
Font Layout Extension Library (optional) libiculx 66 KB
Tool Utility Library (optional) libicutu 203 KB
Data Disk Size Zipped Size Zipped %
Total Data 21.3 MB 8.5 MB 100%
Code Table Mappings 4.4 2.3 27%
Collation Rules 3.3 0.8 9.5%
Language & Region Names 2.8 0.9 10.5%
Time Zone Names 2.1 0.6 7%
Currency Names & Plurals 1.9 0.6 7%
Locale Formats 1.2 0.4 5%
Transliteration Rules and Names 0.6 0.2 2.4 %
Rule Based Number Formatting 0.3 0.1 1.2%
String Preparation (RFC’s) 0.2 0.1 1.2%
Root data? 4.5 2.5 29.5%

Proposal:
1) Include a new build script in qtbase/3rdparty/icu
2) Script is run as part of configure depending on platform and flags
3) Script downloads the src tarball recommended for a given version of Qt
4) Script defaults to building only those features used by Qt on a given platform and removes data resources that are not needed by most clients.
5) Script can either be manually modified to include or excluded more features and data, or can interactively ask during configure step.
6) Script writes modified data options to icu/src/data/*/reslocal.mk build files which override the main *.mk files
7) Script runs configure with build-time options required
8) Build happens as part of normal Qt build.

Suggested flags from ICU readme:
U_USING_ICU_NAMESPACE=0
U_CHARSET_IS_UTF8=1 – On UTF8 platforms
UNISTR_FROM_CHAR_EXPLICIT=explicit
UNISTR_FROM_STRING_EXPLICIT=explicit
U_NO_DEFAULT_INCLUDE_UTF_HEADERS=1
U_HIDE_DRAFT_API
U_HIDE_INTERNAL_API
U_HIDE_SYSTEM_API
—with-library-suffix – Add Qt as a suffix to name

Other options available:
—with-data-packaging=archive – To use .dat file instead
—enable-static —disable-shared – For static builds

See http://thebugfreeblog.blogspot.co.uk/2013/05/cross-building-icu-for-applications-on.html

export CPPFLAGS=”-DU_USING_ICU_NAMESPACE=0 -DU_CHARSET_IS_UTF8=1 -DUNISTR_FROM_CHAR_EXPLICIT=explicit -DUNISTR_FROM_STRING_EXPLICIT=explicit -DU_NO_DEFAULT_INCLUDE_UTF_HEADERS=1”
./runConfigureICU Linux —with-library-suffix=qt —disable-draft —disable-extras —disable-icuio —disable-layout —disable-test —disable-samples

QtWebKit

Two options:
1) Continue using ICU, use new QtCore built copy of ICU, assumes iOS will accept shipping extra copy of ICU.
2) Write new platform back-end using new Qt classes for locale and text layout, but this is a lot of work.

Option 1) is the only practical solution until such time as QtCore can provide all the required functions. The locale functions will come as a result of the new QLocale ICU backend and wrapper classes, and as the design matches ICU closely should be straightforward to implement. The difficulty of the text layout and encoding back-ends is an open question.

ICU includes in QtWebKit

ICU Backend in qtwebkit/Source/WebCore/platform/text/
LineBreakIteratorPoolICU.h
LocaleICU.h/.cpp
LocaleToScriptMappingICU.cpp
TextBreakIteratorICU.h/.cpp
TextCodecICU.h/.cpp
TextEncodingDetector.cpp

ICU Backend in Source/WTF/wtf/:
url/src/URLCanonICU.cpp
unicode/icu/UnicodeIcu.h
unicode/icu/CollatorICU.cpp
unicode/qt4/UnicodeQt4.h

All ICU includes used in QtWebKit source tree, not all are built by Qt port:

Include Language Function Used in WebKit By
<unicode/locid.h> C++ Locale skia, chromium
<unicode/normlzr.h> C++ Normalization freetype, harfbuzz, skia
<unicode/uniset.h> C++ Sets of Unicode Code Points and Strings chromium
<unicode/ubrk.h> C Text Boundary Analysis (Break Iteration) text
<unicode/uchar.h> C Unicode Character Properties and Names win, harfbuzz, wx, chromium, wtf
<unicode/ucnv_cb.h> C Codepage Conversion and Unicode Text Compression text, wtf
<unicode/ucnv.h> C Codepage Conversion and Unicode Text Compression text, wtf
<unicode/udat.h> C Date api JavaScriptCore/runtime, text
<unicode/udatpg.h> C Date Pattern Generator text
<unicode/uidna.h> C International Domain Names in Applications WebCore/platform/KURL.cpp, wtf
<unicode/uloc.h> C Locales text
<unicode/unorm.h> C Normalization graphics/SurrogatePairAwareTextIterator.cpp, win, wx, chromium, text
<unicode/unum.h> C Number Formatting text
<unicode/uscript.h> C Unicode Character Properties and Names chromium, wtf
<unicode/usearch.h> C String Searching WebCore/editing
<unicode/uset.h> C Sets of Unicode Code Points and Strings WebCore/editing
<unicode/ustring.h> C Strings and Character Iteration blackberry, wtf
<unicode/utf16.h> C Strings and Character Iteration blackberry, wx, linux, wtf
<unicode/utypes.h> C Basic Types and Constants text

ICU Documentation

http://site.icu-project.org/charts/charset
http://site.icu-project.org/charts/icu4c-footprint
http://userguide.icu-project.org/packaging
http://userguide.icu-project.org/design
http://userguide.icu-project.org/design#TOC-ICU-Binary-Compatibility:-Using-ICU-as-an-Operating-System-Level-Library
http://userguide.icu-project.org/icudata
http://apps.icu-project.org/datacustom/
http://www.icu-project.org/docs/demo/datacustom_help.html
http://source.icu-project.org/repos/icu/icu/trunk/readme.html