Regexp engine in Qt5: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
(Removed the cleanup tag as the page is now correct) |
||
(14 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]] | |||
= Regular expression engine in Qt5 = | |||
This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf. | |||
* http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html | |||
* https://bugreports.qt.io/browse/QTBUG-20888 | |||
== Current issues with QRegExp == | |||
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html | |||
* QRegExp | === High level issues === | ||
* QRegExp is used for QtScript though it does not fullfill the | |||
** Non-greedy quantifiers (see page 141 titled | * QRegExp API is broken (see '''T7''' in Low Level) | ||
** But current implementation of QtScript uses | * QRegExp is used for QtScript though it does not fullfill the ECMAScript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm ECMA-262-1999] . Missing features include | ||
** Non-greedy quantifiers (see page 141 titled "- 129 -") | |||
** But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container. | |||
* Patternist/XPath also needs Regex features not found in QRegExp, including | * Patternist/XPath also needs Regex features not found in QRegExp, including | ||
** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax ) | ** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax ) | ||
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] | * Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] … | ||
* Customer complained about QRegExp (though I | * Customer complained about QRegExp (though I don't see what's their exact problem): | ||
** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the | ** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.'' | ||
* Public task request: | * Public task request: | ||
** Lookbehind ('''T4''') ([http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry bug 217916] | ** Lookbehind ('''T4''') ([http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry bug 217916]) | ||
** Support for | ** Support for POSIX syntax ([http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry bug 218604]) | ||
** Removing const modifiers ('''T7''') ([http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry bug 219234] | ** Removing const modifiers ('''T7''') ([http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry bug 219234], [http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry bug 209041]) | ||
** Non-greedy quantifiers ('''T3''') ([http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry bug 116127] | ** Non-greedy quantifiers ('''T3''') ([http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry bug 116127]) | ||
===Low Level issues=== | === Low Level issues === | ||
* '''T1''': ^ (caret) and $ (dollar) cannot match at each newline | * '''T1''': ^ (caret) and $ (dollar) cannot match at each newline | ||
Line 33: | Line 35: | ||
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching. | * '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching. | ||
* '''T4''': lookbehind is not supported (lookahead is) | * '''T4''': lookbehind is not supported (lookahead is) | ||
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn( | * '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0 | ||
* '''T6''': only linear input is supported, for a text editor like Kate this does not scale | * '''T6''': only linear input is supported, for a text editor like Kate this does not scale | ||
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object. | * '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object. | ||
=Future= | = Future = | ||
* It '''must''' be a solid 3rd party engine — | * It '''must''' be a solid 3rd party engine — don't want to develop an in-house engine and maintain it. | ||
* QRegExp likely to be moved into its own module in order to keep source compatibility. | * QRegExp likely to be moved into its own module in order to keep source compatibility. | ||
* Addresses the above low-level issues (as much as possible). | * Addresses the above low-level issues (as much as possible). | ||
* (Nice to have) At least the same syntax / features than std::regex | * (Nice to have) At least the same syntax / features than std::regex | ||
==Proposed libraries== | == Proposed libraries == | ||
* [http://www.pcre.org/ | * [http://www.pcre.org/ PCRE] | ||
* | * "V8" | ||
* [http://userguide.icu-project.org/strings/regexp | * [http://userguide.icu-project.org/strings/regexp ICU] | ||
* [http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html Boost.Regex] | * [http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html Boost.Regex] | ||
* std::regex (new in C+ | * std::regex (new in C+''11) | ||
* [http://code.google.com/p/re2/ RE2] | * [http://code.google.com/p/re2/ RE2] | ||
==Feature matrix== | == Feature matrix == | ||
{| class="wikitable" | |||
! | |||
! QRegExp | |||
! PCRE | |||
! V8 | |||
! ICU | |||
! Boost.Regex | |||
! std::regex | |||
! RE2 | |||
|- | |||
! General comments | |||
| See above. | |||
| | |||
| | |||
| | |||
| | |||
| | |||
| | |||
|- | |||
! Already being used in Qt? | |||
| Yes | |||
| Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | |||
| Yes (Qt 5) | |||
| libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | |||
| No | |||
| No | |||
| No | |||
|- | |||
! Pros | |||
| | |||
|Widely used, de-facto standard implementation for Perl-like regexps. | |||
| | |||
|Uses UTF-16 natively. | |||
| | |||
| | |||
|very fast, use a DFA | |||
|- | |||
! Cons | |||
| | |||
|Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |||
| | |||
|Boost does not give guarantees about ABI compatibility. | |||
| | |||
|uses UTF-8, doesn't have the lookbehind neither lookahead | |||
|- | |||
! Fixes T1 | |||
| | |||
|✔ | |||
| | |||
| ✔ | |||
| | |||
| | |||
|✔ | |||
|- | |||
! Fixes T2 | |||
| | |||
|✔ | |||
| | |||
| ✔ | |||
| | |||
| | |||
|✔ | |||
|- | |||
! Fixes T3 | |||
| | |||
|✔ | |||
| | |||
|✔ | |||
| | |||
| | |||
|? | |||
|- | |||
! Fixes T4 | |||
| | |||
|✔ | |||
| | |||
|✘ | |||
| | |||
| | |||
| ✘ | |||
|- | |||
! Fixes T5 | |||
| | |||
|? | |||
| | |||
|? | |||
| | |||
| | |||
|✘ | |||
|- | |||
! Fixes T6 | |||
| | |||
|✔ ("by hand", with partial matching) | |||
| | |||
|Maybe yes, see [http://userguide.icu-project.org/strings/utext UText] . | |||
| | |||
| | |||
|✔ see StringPiece | |||
|- | |||
! Fixes T7 | |||
| | |||
|✔ | |||
| | |||
|✔ | |||
| | |||
| | |||
| ✘ | |||
|} | |||
✘✔ | ✘✔ | ||
{| class=" | |||
=== Supported syntax: Characters === | |||
{| class="wikitable" | |||
! | ! | ||
! QRegExp | ! | ||
! | ! QRegExp | ||
! PCRE | |||
! V8 | ! V8 | ||
! | ! ICU | ||
! Boost.Regex | ! Boost.Regex | ||
! std::regex | ! std::regex | ||
Line 71: | Line 183: | ||
|- | |- | ||
! \a | ! \a | ||
| | |BELL | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \A | ! \A | ||
| beginning of input | |beginning of input | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \b inside a [set] | ! \b inside a [set] | ||
| | |BACKSPACE | ||
| ? | |? | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \b outside a [set] | ! \b outside a [set] | ||
| on a word boundary | |on a word boundary | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \B | ! \B | ||
| not on a word boundary | |not on a word boundary | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | |||
| | |||
| | | | ||
| | |||
|- | |- | ||
! \cX | ! \cX | ||
| | |ASCII control character X | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \d | ! \d | ||
| digit | |digit | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \D | ! \D | ||
| non digit | |non digit | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \e | ! \e | ||
| | |ESCAPE | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \E | ! \E | ||
| end of \Q … \E quoting | |end of \Q … \E quoting | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \f | ! \f | ||
| | |FORM FEED | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \G | ! \G | ||
| end of previous match | |end of previous match | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \n | ! \n | ||
| | |LINE FEED | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \N{x} | ! \N{x} | ||
| | |UNICODE CHARACTER NAME x | ||
| ✘ | |✘ | ||
| ✘ | |✘ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \p{x} | ! \p{x} | ||
| | |UNICODE PROPERTY NAME x | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \P{x} | ! \P{x} | ||
| | |UNICODE PROPERTY NAME not x | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \Q | ! \Q | ||
| start of \Q … \E quoting | |start of \Q … \E quoting | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \r | ! \r | ||
| | |CARRIAGE RETURN | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \s | ! \s | ||
| white space | |white space | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \S | ! \S | ||
| non white space | |non white space | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \t | ! \t | ||
| | |HORIZONTAL TAB | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \uhhhh | ! \uhhhh | ||
| U+hhhh (between U+0000 and U+FFFF) | |U+hhhh (between U+0000 and U+FFFF) | ||
| ✘ | |✘ | ||
| ✘ | |✘ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \Uhhhhhhhh | ! \Uhhhhhhhh | ||
| U+hhhhhhhh (between U+00000000 and U+0010FFFF) | |U+hhhhhhhh (between U+00000000 and U+0010FFFF) | ||
| ✘ | |✘ | ||
| ✘ | |✘ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \v | ! \v | ||
| | | VERTICAL TAB | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✘ | |✘ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \w | ! \w | ||
| word character | |word character | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \W | ! \W | ||
| non word character | |non word character | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \x{hhhh} | ! \x{hhhh} | ||
| U+hhhh | |U+hhhh | ||
| ✘ | |✘ | ||
| ✔ (0-10FFFF) | |✔ (0-10FFFF) | ||
| | | | ||
| ✔ (0-10FFFF) | |✔ (0-10FFFF) | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \xhhhh | ! \xhhhh | ||
| U+hhhh | |U+hhhh | ||
| ✔ (0000- | |✔ (0000-FFFF) | ||
| ✔ (00-FF) | |✔ (00-FF) | ||
| | | | ||
| ✔ (00-FF) | |✔ (00-FF) | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \X | ! \X | ||
| grapheme cluster | |grapheme cluster | ||
| ✘ | |✘ | ||
| ✘ | |✘ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \Z | ! \Z | ||
| end of input (or before the final | |end of input (or before the final ) | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \z | ! \z | ||
| end of input | |end of input | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \n | ! \n | ||
| n-th backreference | |n-th backreference | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \0ooo | ! \0ooo | ||
| | |ASCII/Latin-1 character 0ooo | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! . | ! . | ||
| any character but newlines | |any character but newlines | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! ^ | ! ^ | ||
| line beginning | |line beginning | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! $ | ! $ | ||
| line end | |line end | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! \ | ! \ | ||
| quote the following symbol | |quote the following symbol | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|- | |- | ||
! [pattern] | ! [pattern] | ||
| set | |set | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
|} | |} | ||
===Supported syntax: Operators=== | === Supported syntax: Operators === | ||
{| class=" | {| class="wikitable" | ||
! Operator | ! Operator | ||
! | ! | ||
! QRegExp | ! QRegExp | ||
! | ! PCRE | ||
! V8 | ! V8 | ||
! | ! ICU | ||
! Boost.Regex | ! Boost.Regex | ||
! std::regex | ! std::regex | ||
! RE2 | ! RE2 | ||
|- | |- | ||
! * | ! * | ||
| match 0 or more times | |match 0 or more times | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| ? | |? | ||
| ? | |? | ||
| ✔ | |✔ | ||
|- | |||
! + | |||
|match 1 or more times | |||
|✔ | |||
|✔ | |||
| | |||
|✔ | |||
| | |||
| | |||
|✔ | |||
|- | |||
! ? | |||
|match 0 or 1 times | |||
|✔ | |||
|✔ | |||
| | |||
|✔ | |||
| | |||
| | |||
|✔ | |||
|- | |||
! {n} | |||
|match n times | |||
|✔ | |||
|✔ | |||
| | |||
|✔ | |||
| | |||
| | |||
|✔ | |||
|- | |- | ||
! | ! {n,} | ||
| match | |match n or more times | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! {n,m} | ||
| match | |match between n and m times | ||
| ✔ | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! *? | ||
| match | |match 0 or more times, not greedy | ||
| | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! ''? | ||
| match | |match 1 or more times, not greedy | ||
| | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! ?? | ||
| match | |match 0 or 1 times, not greedy | ||
| | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! {n}? | ||
| match | |match n times | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! {n,}? | ||
| match | |match n or more times, not greedy | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! {n,m}? | ||
| match | |match between n and m times, not greedy | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✔ | |✔ | ||
|- | |- | ||
! | ! *+ | ||
| match | |match 0 or more times, possessive | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|- | |- | ||
! | !''+ | ||
| match | |match 1 or more times, possessive | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|- | |- | ||
! | ! ?'' | ||
| match | |match 0 or 1 times, possessive | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|- | |- | ||
! | ! {n}+ | ||
| match | |match n times | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! | ! {n,}+ | ||
| match | |match n or more times, possessive | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! | ! {n,m}+ | ||
| match | |match between n and m times, possessive | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! | ! ( … ) | ||
| | |capturing group | ||
| | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! | ! (?: … ) | ||
| | |group | ||
| | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! | ! (?> … ) | ||
| | |atomic grouping | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! ( … ) | ! (?# … ) | ||
| | |comment | ||
| | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|- | |- | ||
! (? | ! (?= … ) | ||
| | |look-ahead assertion | ||
| ✔ | |✔ | ||
|✔ | |||
| ✔ | |||
| | | | ||
| | |✔ | ||
| | | | ||
| | |||
|✘ | |||
|- | |- | ||
! (? | ! (?! … ) | ||
| | |negative look-ahead assertion | ||
| | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! (? | ! (?<= … ) | ||
| | |look-behind assertion | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! (? | ! (?<! … ) | ||
| look- | |negative look-behind assertion | ||
| | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| ✘ | |✘ | ||
|- | |- | ||
! (? | ! (?flags: … ) | ||
| | |flags change | ||
| | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! (? | ! (?flags) | ||
| | |flags change | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✔ | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! (? | ! (?P<name> …) | ||
| | |named capturing group | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| | |✘ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! (? | ! (?<name> …) | ||
| | |named capturing group | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| | |✘ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|- | |- | ||
! (? | ! (?'name' …) | ||
| | |named capturing group | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| | |✘ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|- | |- | ||
! (? | ! (?| …) | ||
| | |branch reset | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| ✘ | |✘ | ||
| | | | ||
| | | | ||
| | |✘ | ||
|} | |||
=== Supported syntax: flags === | |||
{| class="wikitable" | |||
! Flag | |||
! | |||
! QRegExp | |||
! PCRE | |||
! V8 | |||
! ICU | |||
! Boost.Regex | |||
! std::regex | |||
! RE2 | |||
|- | |- | ||
! | ! /i | ||
| | |case insensitive | ||
| | |✔ | ||
| ✔ | |✔ | ||
| | | | ||
| | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! | ! /m | ||
| | |multi-line | ||
| ✘ | |✘ | ||
| ✔ | |✔ | ||
| | | | ||
| | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! | ! /s | ||
| | |dot matches anything | ||
| | |~<ref>It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.</ref> | ||
| ✔ | |✔ | ||
| | | | ||
| | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! | ! /x | ||
| | |ignore whitespace and comments | ||
| | |✘ | ||
| | |✔ | ||
| | | | ||
| | |✔ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|- | |- | ||
! | ! /U | ||
| | |minimal match | ||
| | |✔ | ||
| | |✔ | ||
| | | | ||
| | |✘ | ||
| | | | ||
| | | | ||
| | |✔ | ||
|} | |} | ||
= Benchmarks = | |||
=Benchmarks= | |||
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results. | See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results. | ||
< | <references /> | ||
Latest revision as of 15:41, 22 May 2015
Regular expression engine in Qt5
This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.
- http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html
- https://bugreports.qt.io/browse/QTBUG-20888
Current issues with QRegExp
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html
High level issues
- QRegExp API is broken (see T7 in Low Level)
- QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
- Non-greedy quantifiers (see page 141 titled "- 129 -")
- But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
- Patternist/XPath also needs Regex features not found in QRegExp, including
- Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
- Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem …
- Customer complained about QRegExp (though I don't see what's their exact problem):
- In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
- Public task request:
- Lookbehind (T4) (bug 217916)
- Support for POSIX syntax (bug 218604)
- Removing const modifiers (T7) (bug 219234, bug 209041)
- Non-greedy quantifiers (T3) (bug 116127)
Low Level issues
- T1: ^ (caret) and $ (dollar) cannot match at each newline
- T2: . (dot) always matches newlines
- T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
- T4: lookbehind is not supported (lookahead is)
- T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
- T6: only linear input is supported, for a text editor like Kate this does not scale
- T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
Future
- It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
- QRegExp likely to be moved into its own module in order to keep source compatibility.
- Addresses the above low-level issues (as much as possible).
- (Nice to have) At least the same syntax / features than std::regex
Proposed libraries
- PCRE
- "V8"
- ICU
- Boost.Regex
- std::regex (new in C+11)
- RE2
Feature matrix
QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|
General comments | See above. | ||||||
Already being used in Qt? | Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |
Pros | Widely used, de-facto standard implementation for Perl-like regexps. | Uses UTF-16 natively. | very fast, use a DFA | ||||
Cons | Does not run on every platform supported by QtCore / QtBase[2]. | Boost does not give guarantees about ABI compatibility. | uses UTF-8, doesn't have the lookbehind neither lookahead | ||||
Fixes T1 | ✔ | ✔ | ✔ | ||||
Fixes T2 | ✔ | ✔ | ✔ | ||||
Fixes T3 | ✔ | ✔ | ? | ||||
Fixes T4 | ✔ | ✘ | ✘ | ||||
Fixes T5 | ? | ? | ✘ | ||||
Fixes T6 | ✔ ("by hand", with partial matching) | Maybe yes, see UText . | ✔ see StringPiece | ||||
Fixes T7 | ✔ | ✔ | ✘ |
✘✔
Supported syntax: Characters
QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | ||
---|---|---|---|---|---|---|---|---|
\a | BELL | ✔ | ✔ | ✔ | ||||
\A | beginning of input | ✔ | ✔ | |||||
\b inside a [set] | BACKSPACE | ? | ✔ | |||||
\b outside a [set] | on a word boundary | ✔ | ✔ | |||||
\B | not on a word boundary | ✔ | ✔ | ✔ | ||||
\cX | ASCII control character X | ✘ | ✔ | ✔ | ||||
\d | digit | ✔ | ✔ | ✔ | ||||
\D | non digit | ✔ | ✔ | ✔ | ||||
\e | ESCAPE | ✘ | ✔ | ✔ | ||||
\E | end of \Q … \E quoting | ✘ | ✔ | ✔ | ||||
\f | FORM FEED | ✔ | ✔ | ✔ | ||||
\G | end of previous match | ✘ | ✔ | ✔ | ||||
\n | LINE FEED | ✔ | ✔ | |||||
\N{x} | UNICODE CHARACTER NAME x | ✘ | ✘ | ✔ | ||||
\p{x} | UNICODE PROPERTY NAME x | ✘ | ✔ | ✔ | ||||
\P{x} | UNICODE PROPERTY NAME not x | ✘ | ✔ | ✔ | ||||
\Q | start of \Q … \E quoting | ✘ | ✔ | ✔ | ||||
\r | CARRIAGE RETURN | ✔ | ✔ | ✔ | ||||
\s | white space | ✔ | ✔ | ✔ | ||||
\S | non white space | ✔ | ✔ | ✔ | ||||
\t | HORIZONTAL TAB | ✔ | ✔ | ✔ | ||||
\uhhhh | U+hhhh (between U+0000 and U+FFFF) | ✘ | ✘ | |||||
\Uhhhhhhhh | U+hhhhhhhh (between U+00000000 and U+0010FFFF) | ✘ | ✘ | |||||
\v | VERTICAL TAB | ✔ | ✔ | ✘ | ||||
\w | word character | ✔ | ✔ | ✔ | ||||
\W | non word character | ✔ | ✔ | ✔ | ||||
\x{hhhh} | U+hhhh | ✘ | ✔ (0-10FFFF) | ✔ (0-10FFFF) | ||||
\xhhhh | U+hhhh | ✔ (0000-FFFF) | ✔ (00-FF) | ✔ (00-FF) | ||||
\X | grapheme cluster | ✘ | ✘ | |||||
\Z | end of input (or before the final ) | ✘ | ✔ | ✔ | ||||
\z | end of input | ✘ | ✔ | ✔ | ||||
\n | n-th backreference | ✔ | ✔ | |||||
\0ooo | ASCII/Latin-1 character 0ooo | ✔ | ✔ | |||||
. | any character but newlines | ✔ | ✔ | ✔ | ||||
^ | line beginning | ✔ | ✔ | ✔ | ||||
$ | line end | ✔ | ✔ | ✔ | ||||
\ | quote the following symbol | ✔ | ✔ | ✔ | ||||
[pattern] | set | ✔ | ✔ | ✔ |
Supported syntax: Operators
Operator | QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|---|
* | match 0 or more times | ✔ | ✔ | ✔ | ? | ? | ✔ | |
+ | match 1 or more times | ✔ | ✔ | ✔ | ✔ | |||
? | match 0 or 1 times | ✔ | ✔ | ✔ | ✔ | |||
{n} | match n times | ✔ | ✔ | ✔ | ✔ | |||
{n,} | match n or more times | ✔ | ✔ | ✔ | ✔ | |||
{n,m} | match between n and m times | ✔ | ✔ | ✔ | ✔ | |||
*? | match 0 or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
? | match 1 or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
?? | match 0 or 1 times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
{n}? | match n times | ✘ | ✔ | ✔ | ✔ | |||
{n,}? | match n or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
{n,m}? | match between n and m times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
*+ | match 0 or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
+ | match 1 or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
? | match 0 or 1 times, possessive | ✘ | ✔ | ✔ | ✘ | |||
{n}+ | match n times | ✘ | ✔ | ✔ | ✘ | |||
{n,}+ | match n or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
{n,m}+ | match between n and m times, possessive | ✘ | ✔ | ✔ | ✘ | |||
( … ) | capturing group | ✔ | ✔ | ✔ | ✔ | |||
(?: … ) | group | ✔ | ✔ | ✔ | ✔ | |||
(?> … ) | atomic grouping | ✘ | ✔ | ✔ | ✘ | |||
(?# … ) | comment | ✘ | ✔ | ✔ | ✘ | |||
(?= … ) | look-ahead assertion | ✔ | ✔ | ✔ | ✘ | |||
(?! … ) | negative look-ahead assertion | ✔ | ✔ | ✔ | ✘ | |||
(?<= … ) | look-behind assertion | ✘ | ✔ | ✔ | ✘ | |||
(?<! … ) | negative look-behind assertion | ✘ | ✔ | ✔ | ✘ | |||
(?flags: … ) | flags change | ✘ | ✔ | ✔ | ||||
(?flags) | flags change | ✘ | ✔ | ✔ | ✔ | |||
(?P<name> …) | named capturing group | ✘ | ✔ | ✘ | ✔ | |||
(?<name> …) | named capturing group | ✘ | ✔ | ✘ | ✘ | |||
(?'name' …) | named capturing group | ✘ | ✔ | ✘ | ✘ | |||
…) | branch reset | ✘ | ✔ | ✘ | ✘ |
Supported syntax: flags
Flag | QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|---|
/i | case insensitive | ✔ | ✔ | ✔ | ✔ | |||
/m | multi-line | ✘ | ✔ | ✔ | ✔ | |||
/s | dot matches anything | ~[1] | ✔ | ✔ | ✔ | |||
/x | ignore whitespace and comments | ✘ | ✔ | ✔ | ✔ | |||
/U | minimal match | ✔ | ✔ | ✘ | ✔ |
Benchmarks
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.
- ↑ It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.