Regexp engine in Qt5: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]] | [[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]] | ||
[toc align_right="yes" depth="3"] | |||
= Regular expression engine in Qt5 = | = Regular expression engine in Qt5 = | ||
This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf. | This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf. | ||
* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html | |||
* https://bugreports.qt.nokia.com/browse/QTBUG-20888 | |||
== Current issues with QRegExp == | == Current issues with QRegExp == | ||
Line 12: | Line 15: | ||
* QRegExp API is broken (see '''T7''' in Low Level) | * QRegExp API is broken (see '''T7''' in Low Level) | ||
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification | * QRegExp is used for QtScript though it does not fullfill the ECMAScript specification "ECMA-262-1999":http://www.ecma-international.org/publications/standards/Ecma-262.htm . Missing features include | ||
** Non-greedy quantifiers (see page 141 titled | ** Non-greedy quantifiers (see page 141 titled "- 129 -") | ||
'''''' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container. | |||
* Patternist/XPath also needs Regex features not found in QRegExp, including | |||
'''''' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax ) | |||
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly "that problem":http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 … | |||
* Customer complained about QRegExp (though I don't see what's their exact problem): | * Customer complained about QRegExp (though I don't see what's their exact problem): | ||
** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.'' | ** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.'' | ||
* Public task request: | * Public task request: | ||
** Lookbehind ('''T4''') ( | ** Lookbehind ('''T4''') ("bug 217916":http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry) | ||
** Support for POSIX syntax ( | ** Support for POSIX syntax ("bug 218604":http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry) | ||
** Removing const modifiers ('''T7''') ( | ** Removing const modifiers ('''T7''') ("bug 219234":http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry, "bug 209041":http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry) | ||
** Non-greedy quantifiers ('''T3''') ( | ** Non-greedy quantifiers ('''T3''') ("bug 116127":http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry) | ||
=== Low Level issues === | === Low Level issues === | ||
Line 28: | Line 35: | ||
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching. | * '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching. | ||
* '''T4''': lookbehind is not supported (lookahead is) | * '''T4''': lookbehind is not supported (lookahead is) | ||
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn( | * '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".'''" returns 3, not 0 | ||
''' '''T6''': only linear input is supported, for a text editor like Kate this does not scale | |||
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object. | * '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object. | ||
Line 40: | Line 48: | ||
== Proposed libraries == | == Proposed libraries == | ||
* | * "PCRE":http://www.pcre.org/ | ||
* | * "V8" | ||
* | * "ICU":http://userguide.icu-project.org/strings/regexp | ||
* | * "Boost.Regex":http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html | ||
* std::regex (new in C+''11) | * std::regex (new in C+''11) | ||
* "RE2":http://code.google.com/p/re2/ | |||
h2. Feature matrix | |||
|''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2| | |||
|''. General comments|See above.| | | | | | | | |||
|''. Already being used in Qt?| Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No | | |||
|''. Pros | |Widely used, de-facto standard implementation for Perl-like regexps.| |Uses UTF-16 natively. | | |very fast, use a DFA| | |||
|''. Cons | |Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |Boost does not give guarantees about ABI compatibility.| |uses UTF-8, doesn't have the lookbehind neither lookahead | | |||
|''. Fixes T1 | |✔| | ✔| | |✔ | | |||
|''. Fixes T2 | |✔| | ✔| | |✔ | | |||
|''. Fixes T3 | |✔| |✔| | |? | | |||
|''. Fixes T4 | |✔| |✘| | | ✘| | |||
|''. Fixes T5 | |?| |?| | |✘ | | |||
|''. Fixes T6 | |✔ ("by hand", with partial matching)| |Maybe yes, see "UText":http://userguide.icu-project.org/strings/utext .| | |✔ see StringPiece | | |||
|''. Fixes T7 | |✔| |✔| | | ✘| | |||
✘✔ | |||
< | |||
h3. Supported syntax: Characters | |||
|''. |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2| | |||
|''. |BELL|✔|✔| |✔| | | | | |||
|''. |beginning of input| |✔| |✔| | | | | |||
|''. inside a [set]|BACKSPACE|?|✔| | | | | | | |||
|''. outside a [set]|on a word boundary|✔|✔| | | | | | | |||
|''. |not on a word boundary|✔|✔| |✔| | | | | |||
|''. |ASCII control character X|✘|✔| |✔| | | | | |||
|''. |digit|✔|✔| |✔| | | | | |||
|''. |non digit|✔|✔| |✔| | | | | |||
|''. |ESCAPE|✘|✔| |✔| | | | | |||
|''. |end of … quoting|✘|✔| |✔| | | | | |||
|''. |FORM FEED|✔|✔| |✔| | | | | |||
|''. |end of previous match|✘|✔| |✔| | | | | |||
|''. |LINE FEED|✔|✔| | | | | | | |||
|''. {x}|UNICODE CHARACTER NAME x|✘|✘| |✔| | | | | |||
|''. {x}|UNICODE PROPERTY NAME x|✘|✔| |✔| | | | | |||
|''. {x}|UNICODE PROPERTY NAME not x|✘|✔| |✔| | | | | |||
|''. |start of … quoting|✘|✔| |✔| | | | | |||
|''. |CARRIAGE RETURN|✔|✔| |✔| | | | | |||
|''. |white space|✔|✔| |✔| | | | | |||
|''. |non white space|✔|✔| |✔| | | | | |||
|''. |HORIZONTAL TAB|✔|✔| |✔| | | | | |||
|''. |U+hhhh (between U+0000 and U+FFFF)|✘|✘| | | | | | | |||
|''. |U+hhhhhhhh (between U+00000000 and U+0010FFFF)|✘|✘| | | | | | | |||
|''. VERTICAL TAB|✔|✔| |✘| | | | | |||
|''. |word character|✔|✔| |✔| | | | | |||
|''. |non word character|✔|✔| |✔| | | | | |||
|''. {hhhh}|U+hhhh|✘|✔ (0-10FFFF)| |✔ (0-10FFFF)| | | | | |||
|''. |U+hhhh|✔ (0000-FFFF)|✔ (00-FF)| |✔ (00-FF)| | | | | |||
|''. |grapheme cluster|✘|✘| | | | | | | |||
|''. |end of input (or before the final )|✘|✔| |✔| | | | | |||
|''. |end of input|✘|✔| |✔| | | | | |||
|''. |n-th backreference|✔|✔| | | | | | | |||
|''. ooo|ASCII/Latin-1 character 0ooo|✔|✔| | | | | | | |||
|''. .|any character but newlines|✔|✔| |✔| | | | | |||
|''. ^|line beginning|✔|✔| |✔| | | | | |||
|''. $|line end|✔|✔| |✔| | | | | |||
|''. quote the following symbol|✔|✔| |✔| | | | | |||
|''. [pattern]|set|✔|✔| |✔| | | | | |||
h3. Supported syntax: Operators | |||
|''. Operator |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2| | |||
|''. * |match 0 or more times|✔|✔| |✔|?|?|✔| | |||
|''. + |match 1 or more times|✔|✔| |✔| | |✔| | |||
|''. ? |match 0 or 1 times|✔|✔| |✔| | |✔| | |||
|''. {n} |match n times|✔|✔| |✔| | |✔| | |||
|''. {n,} |match n or more times|✔|✔| |✔| | |✔| | |||
|''. {n,m} |match between n and m times|✔|✔| |✔| | |✔| | |||
|''. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔| | |||
|''. ''? |match 1 or more times, not greedy|✘|✔| |✔| | |✔| | |||
|''. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔| | |||
|''. {n}? |match n times|✘|✔| |✔| | |✔| | |||
|''. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔| | |||
|''. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔| | |||
|''. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘| | |||
|''.''+ |match 1 or more times, possessive|✘|✔| |✔| | |✘| | |||
|''. ?'' |match 0 or 1 times, possessive|✘|✔| |✔| | |✘| | |||
|''. {n}+ |match n times|✘|✔| |✔| | |✘| | |||
|''. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘| | |||
|''. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘| | |||
|''. ( … ) |capturing group|✔|✔| |✔| | |✔| | |||
|''. (?: … ) |group|✔|✔| |✔| | |✔| | |||
|''. (?> … ) |atomic grouping|✘|✔| |✔| | |✘| | |||
|''. (?# … ) |comment|✘|✔| |✔| | |✘| | |||
|''. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘| | |||
|''. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘| | |||
|''. (?<= … ) |look-behind assertion|✘|✔| |✔| | |✘| | |||
|''. (?<! … ) |negative look-behind assertion|✘|✔| |✔| | |✘| | |||
|''. (?flags: … ) |flags change|✘|✔| | | | |✔| | |||
|''. (?flags) |flags change|✘|✔| |✔| | |✔| | |||
|''. (?P<name> …) |named capturing group|✘|✔| |✘| | |✔| | |||
|''. (?<name> …) |named capturing group|✘|✔| |✘| | |✘| | |||
|''. (?'name' …) |named capturing group|✘|✔| |✘| | |✘| | |||
|''. (?&#x7c; …) |branch reset|✘|✔| |✘| | |✘| | |||
|''. | | | | | | | | | | |||
|''. | | | | | | | | | | |||
h3. Supported syntax: flags | |||
|''. Flag |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2| | |||
|''. /i |case insensitive|✔|✔| |✔| | |✔| | |||
|''. /m |multi-line|✘|✔| |✔| | |✔| | |||
|''. /s |dot matches anything|~<ref>It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.</ref> |✔| |✔| | |✔| | |||
|''. /x |ignore whitespace and comments|✘|✔| |✔| | |✔| | |||
|_. /U |minimal match|✔|✔| |✘| | |✔| | |||
= Benchmarks = | = Benchmarks = |
Revision as of 09:48, 25 February 2015
[toc align_right="yes" depth="3"]
Regular expression engine in Qt5
This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.
- http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
- https://bugreports.qt.nokia.com/browse/QTBUG-20888
Current issues with QRegExp
From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html
High level issues
- QRegExp API is broken (see T7 in Low Level)
- QRegExp is used for QtScript though it does not fullfill the ECMAScript specification "ECMA-262-1999":http://www.ecma-international.org/publications/standards/Ecma-262.htm . Missing features include
- Non-greedy quantifiers (see page 141 titled "- 129 -")
' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
- Patternist/XPath also needs Regex features not found in QRegExp, including
' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
- Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly "that problem":http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 …
- Customer complained about QRegExp (though I don't see what's their exact problem):
- In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
- Public task request:
- Lookbehind (T4) ("bug 217916":http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry)
- Support for POSIX syntax ("bug 218604":http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry)
- Removing const modifiers (T7) ("bug 219234":http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry, "bug 209041":http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry)
- Non-greedy quantifiers (T3) ("bug 116127":http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry)
Low Level issues
- T1: ^ (caret) and $ (dollar) cannot match at each newline
- T2: . (dot) always matches newlines
- T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
- T4: lookbehind is not supported (lookahead is)
- T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern "." returns 3, not 0
T6: only linear input is supported, for a text editor like Kate this does not scale
- T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
Future
- It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
- QRegExp likely to be moved into its own module in order to keep source compatibility.
- Addresses the above low-level issues (as much as possible).
- (Nice to have) At least the same syntax / features than std::regex
Proposed libraries
- "PCRE":http://www.pcre.org/
- "V8"
- "ICU":http://userguide.icu-project.org/strings/regexp
- "Boost.Regex":http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html
- std::regex (new in C+11)
- "RE2":http://code.google.com/p/re2/
h2. Feature matrix
|. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2| |. General comments|See above.| | | | | | | |. Already being used in Qt?| Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No | |. Pros | |Widely used, de-facto standard implementation for Perl-like regexps.| |Uses UTF-16 natively. | | |very fast, use a DFA| |. Cons | |Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |Boost does not give guarantees about ABI compatibility.| |uses UTF-8, doesn't have the lookbehind neither lookahead | |. Fixes T1 | |✔| | ✔| | |✔ | |. Fixes T2 | |✔| | ✔| | |✔ | |. Fixes T3 | |✔| |✔| | |? | |. Fixes T4 | |✔| |✘| | | ✘| |. Fixes T5 | |?| |?| | |✘ | |. Fixes T6 | |✔ ("by hand", with partial matching)| |Maybe yes, see "UText":http://userguide.icu-project.org/strings/utext .| | |✔ see StringPiece | |. Fixes T7 | |✔| |✔| | | ✘|
✘✔
h3. Supported syntax: Characters
|. |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2| |. |BELL|✔|✔| |✔| | | | |. |beginning of input| |✔| |✔| | | | |. inside a [set]|BACKSPACE|?|✔| | | | | | |. outside a [set]|on a word boundary|✔|✔| | | | | | |. |not on a word boundary|✔|✔| |✔| | | | |. |ASCII control character X|✘|✔| |✔| | | | |. |digit|✔|✔| |✔| | | | |. |non digit|✔|✔| |✔| | | | |. |ESCAPE|✘|✔| |✔| | | | |. |end of … quoting|✘|✔| |✔| | | | |. |FORM FEED|✔|✔| |✔| | | | |. |end of previous match|✘|✔| |✔| | | | |. |LINE FEED|✔|✔| | | | | | |. {x}|UNICODE CHARACTER NAME x|✘|✘| |✔| | | | |. {x}|UNICODE PROPERTY NAME x|✘|✔| |✔| | | | |. {x}|UNICODE PROPERTY NAME not x|✘|✔| |✔| | | | |. |start of … quoting|✘|✔| |✔| | | | |. |CARRIAGE RETURN|✔|✔| |✔| | | | |. |white space|✔|✔| |✔| | | | |. |non white space|✔|✔| |✔| | | | |. |HORIZONTAL TAB|✔|✔| |✔| | | | |. |U+hhhh (between U+0000 and U+FFFF)|✘|✘| | | | | | |. |U+hhhhhhhh (between U+00000000 and U+0010FFFF)|✘|✘| | | | | | |. VERTICAL TAB|✔|✔| |✘| | | | |. |word character|✔|✔| |✔| | | | |. |non word character|✔|✔| |✔| | | | |. {hhhh}|U+hhhh|✘|✔ (0-10FFFF)| |✔ (0-10FFFF)| | | | |. |U+hhhh|✔ (0000-FFFF)|✔ (00-FF)| |✔ (00-FF)| | | | |. |grapheme cluster|✘|✘| | | | | | |. |end of input (or before the final )|✘|✔| |✔| | | | |. |end of input|✘|✔| |✔| | | | |. |n-th backreference|✔|✔| | | | | | |. ooo|ASCII/Latin-1 character 0ooo|✔|✔| | | | | | |. .|any character but newlines|✔|✔| |✔| | | | |. ^|line beginning|✔|✔| |✔| | | | |. $|line end|✔|✔| |✔| | | | |. quote the following symbol|✔|✔| |✔| | | | |. [pattern]|set|✔|✔| |✔| | | |
h3. Supported syntax: Operators
|. Operator |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2| |. * |match 0 or more times|✔|✔| |✔|?|?|✔| |. + |match 1 or more times|✔|✔| |✔| | |✔| |. ? |match 0 or 1 times|✔|✔| |✔| | |✔| |. {n} |match n times|✔|✔| |✔| | |✔| |. {n,} |match n or more times|✔|✔| |✔| | |✔| |. {n,m} |match between n and m times|✔|✔| |✔| | |✔| |. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔| |. ? |match 1 or more times, not greedy|✘|✔| |✔| | |✔| |. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔| |. {n}? |match n times|✘|✔| |✔| | |✔| |. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔| |. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔| |. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘| |.+ |match 1 or more times, possessive|✘|✔| |✔| | |✘| |. ? |match 0 or 1 times, possessive|✘|✔| |✔| | |✘| |. {n}+ |match n times|✘|✔| |✔| | |✘| |. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘| |. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘| |. ( … ) |capturing group|✔|✔| |✔| | |✔| |. (?: … ) |group|✔|✔| |✔| | |✔| |. (?> … ) |atomic grouping|✘|✔| |✔| | |✘| |. (?# … ) |comment|✘|✔| |✔| | |✘| |. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘| |. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘| |. (?<= … ) |look-behind assertion|✘|✔| |✔| | |✘| |. (?<! … ) |negative look-behind assertion|✘|✔| |✔| | |✘| |. (?flags: … ) |flags change|✘|✔| | | | |✔| |. (?flags) |flags change|✘|✔| |✔| | |✔| |. (?P<name> …) |named capturing group|✘|✔| |✘| | |✔| |. (?<name> …) |named capturing group|✘|✔| |✘| | |✘| |. (?'name' …) |named capturing group|✘|✔| |✘| | |✘| |. (?| …) |branch reset|✘|✔| |✘| | |✘| |. | | | | | | | | | |. | | | | | | | | |
h3. Supported syntax: flags
|. Flag |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2| |. /i |case insensitive|✔|✔| |✔| | |✔| |. /m |multi-line|✘|✔| |✔| | |✔| |. /s |dot matches anything|~[1] |✔| |✔| | |✔| |. /x |ignore whitespace and comments|✘|✔| |✔| | |✔| |_. /U |minimal match|✔|✔| |✘| | |✔|
Benchmarks
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.
- ↑ It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.