Regexp engine in Qt5

From Qt Wiki
Revision as of 09:26, 24 February 2015 by Maintenance script (talk | contribs)
Jump to navigation Jump to search


[toc align_right="yes" depth="3"]

Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.
* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
* https://bugreports.qt.nokia.com/browse/QTBUG-20888

Current issues with QRegExp

From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html

High level issues

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern "." returns 3, not 0
    T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries


h2. Feature matrix
|. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|. General comments|See above.| | | | | | |
|
. Already being used in Qt?| Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |
|. Pros | |Widely used, de-facto standard implementation for Perl-like regexps.| |Uses UTF-16 natively. | | |very fast, use a DFA|
|
. Cons | |Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |Boost does not give guarantees about ABI compatibility.| |uses UTF-8, doesn't have the lookbehind neither lookahead |
|. Fixes T1 | |✔| | ✔| | |✔ |
|
. Fixes T2 | |✔| | ✔| | |✔ |
|. Fixes T3 | |✔| |✔| | |? |
|
. Fixes T4 | |✔| |✘| | | ✘|
|. Fixes T5 | |?| |?| | |✘ |
|
. Fixes T6 | |✔ ("by hand", with partial matching)| |Maybe yes, see "UText":http://userguide.icu-project.org/strings/utext .| | |✔ see StringPiece |
|. Fixes T7 | |✔| |✔| | | ✘|
✘✔



h3. Supported syntax: Characters
|. |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|
. |BELL|✔|✔| |✔| | | |
|. |beginning of input| |✔| |✔| | | |
|
. inside a [set]|BACKSPACE|?|✔| | | | | |
|. outside a [set]|on a word boundary|✔|✔| | | | | |
|
. |not on a word boundary|✔|✔| |✔| | | |
|. |ASCII control character X|✘|✔| |✔| | | |
|
. |digit|✔|✔| |✔| | | |
|. |non digit|✔|✔| |✔| | | |
|
. |ESCAPE|✘|✔| |✔| | | |
|. |end of … quoting|✘|✔| |✔| | | |
|
. |FORM FEED|✔|✔| |✔| | | |
|. |end of previous match|✘|✔| |✔| | | |
|
. |LINE FEED|✔|✔| | | | | |
|. {x}|UNICODE CHARACTER NAME x|✘|✘| |✔| | | |
|
. {x}|UNICODE PROPERTY NAME x|✘|✔| |✔| | | |
|. {x}|UNICODE PROPERTY NAME not x|✘|✔| |✔| | | |
|
. |start of … quoting|✘|✔| |✔| | | |
|. |CARRIAGE RETURN|✔|✔| |✔| | | |
|
. |white space|✔|✔| |✔| | | |
|. |non white space|✔|✔| |✔| | | |
|
. |HORIZONTAL TAB|✔|✔| |✔| | | |
|. |U+hhhh (between U+0000 and U+FFFF)|✘|✘| | | | | |
|
. |U+hhhhhhhh (between U+00000000 and U+0010FFFF)|✘|✘| | | | | |
|. VERTICAL TAB|✔|✔| |✘| | | |
|
. |word character|✔|✔| |✔| | | |
|. |non word character|✔|✔| |✔| | | |
|
. {hhhh}|U+hhhh|✘|✔ (0-10FFFF)| |✔ (0-10FFFF)| | | |
|. |U+hhhh|✔ (0000-FFFF)|✔ (00-FF)| |✔ (00-FF)| | | |
|
. |grapheme cluster|✘|✘| | | | | |
|. |end of input (or before the final )|✘|✔| |✔| | | |
|
. |end of input|✘|✔| |✔| | | |
|. |n-th backreference|✔|✔| | | | | |
|
. ooo|ASCII/Latin-1 character 0ooo|✔|✔| | | | | |
|. .|any character but newlines|✔|✔| |✔| | | |
|
. ^|line beginning|✔|✔| |✔| | | |
|. $|line end|✔|✔| |✔| | | |
|
. quote the following symbol|✔|✔| |✔| | | |
|. [pattern]|set|✔|✔| |✔| | | |


h3. Supported syntax: Operators
|. Operator |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|
. * |match 0 or more times|✔|✔| |✔|?|?|✔|
|. + |match 1 or more times|✔|✔| |✔| | |✔|
|
. ? |match 0 or 1 times|✔|✔| |✔| | |✔|
|. {n} |match n times|✔|✔| |✔| | |✔|
|
. {n,} |match n or more times|✔|✔| |✔| | |✔|
|. {n,m} |match between n and m times|✔|✔| |✔| | |✔|
|
. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔|
|. ? |match 1 or more times, not greedy|✘|✔| |✔| | |✔|
|. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔|
|
. {n}? |match n times|✘|✔| |✔| | |✔|
|. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔|
|
. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔|
|. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘|
|
.+ |match 1 or more times, possessive|✘|✔| |✔| | |✘|
|
. ? |match 0 or 1 times, possessive|✘|✔| |✔| | |✘|
|
. {n}+ |match n times|✘|✔| |✔| | |✘|
|. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘|
|
. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘|
|. ( … ) |capturing group|✔|✔| |✔| | |✔|
|
. (?: … ) |group|✔|✔| |✔| | |✔|
|. (?> … ) |atomic grouping|✘|✔| |✔| | |✘|
|
. (?# … ) |comment|✘|✔| |✔| | |✘|
|. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘|
|
. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘|
|. (?<= … ) |look-behind assertion|✘|✔| |✔| | |✘|
|
. (?<! … ) |negative look-behind assertion|✘|✔| |✔| | |✘|
|. (?flags: … ) |flags change|✘|✔| | | | |✔|
|
. (?flags) |flags change|✘|✔| |✔| | |✔|
|. (?P&lt;name&gt; …) |named capturing group|✘|✔| |✘| | |✔|
|
. (?<name&gt; …) |named capturing group|✘|✔| |✘| | |✘|
|. (?'name' …) |named capturing group|✘|✔| |✘| | |✘|
|
. (?&#x7c; …) |branch reset|✘|✔| |✘| | |✘|
|. | | | | | | | | |
|
. | | | | | | | | |


h3. Supported syntax: flags
|. Flag |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|
. /i |case insensitive|✔|✔| |✔| | |✔|
|. /m |multi-line|✘|✔| |✔| | |✔|
|
. /s |dot matches anything|~[1] |✔| |✔| | |✔|
|. /x |ignore whitespace and comments|✘|✔| |✔| | |✔|
|_. /U |minimal match|✔|✔| |✘| | |✔|

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.