Regexp engine in Qt5
Jump to navigation
Jump to search
Regular expression engine in Qt5
This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.
- http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html
- https://bugreports.qt.io/browse/QTBUG-20888
Current issues with QRegExp
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html
High level issues
- QRegExp API is broken (see T7 in Low Level)
- QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
- Non-greedy quantifiers (see page 141 titled "- 129 -")
- But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
- Patternist/XPath also needs Regex features not found in QRegExp, including
- Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
- Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem …
- Customer complained about QRegExp (though I don't see what's their exact problem):
- In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
- Public task request:
- Lookbehind (T4) (bug 217916)
- Support for POSIX syntax (bug 218604)
- Removing const modifiers (T7) (bug 219234, bug 209041)
- Non-greedy quantifiers (T3) (bug 116127)
Low Level issues
- T1: ^ (caret) and $ (dollar) cannot match at each newline
- T2: . (dot) always matches newlines
- T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
- T4: lookbehind is not supported (lookahead is)
- T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
- T6: only linear input is supported, for a text editor like Kate this does not scale
- T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
Future
- It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
- QRegExp likely to be moved into its own module in order to keep source compatibility.
- Addresses the above low-level issues (as much as possible).
- (Nice to have) At least the same syntax / features than std::regex
Proposed libraries
- PCRE
- "V8"
- ICU
- Boost.Regex
- std::regex (new in C+11)
- RE2
Feature matrix
QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|
General comments | See above. | ||||||
Already being used in Qt? | Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |
Pros | Widely used, de-facto standard implementation for Perl-like regexps. | Uses UTF-16 natively. | very fast, use a DFA | ||||
Cons | Does not run on every platform supported by QtCore / QtBase[2]. | Boost does not give guarantees about ABI compatibility. | uses UTF-8, doesn't have the lookbehind neither lookahead | ||||
Fixes T1 | ✔ | ✔ | ✔ | ||||
Fixes T2 | ✔ | ✔ | ✔ | ||||
Fixes T3 | ✔ | ✔ | ? | ||||
Fixes T4 | ✔ | ✘ | ✘ | ||||
Fixes T5 | ? | ? | ✘ | ||||
Fixes T6 | ✔ ("by hand", with partial matching) | Maybe yes, see UText . | ✔ see StringPiece | ||||
Fixes T7 | ✔ | ✔ | ✘ |
✘✔
Supported syntax: Characters
QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | ||
---|---|---|---|---|---|---|---|---|
\a | BELL | ✔ | ✔ | ✔ | ||||
\A | beginning of input | ✔ | ✔ | |||||
\b inside a [set] | BACKSPACE | ? | ✔ | |||||
\b outside a [set] | on a word boundary | ✔ | ✔ | |||||
\B | not on a word boundary | ✔ | ✔ | ✔ | ||||
\cX | ASCII control character X | ✘ | ✔ | ✔ | ||||
\d | digit | ✔ | ✔ | ✔ | ||||
\D | non digit | ✔ | ✔ | ✔ | ||||
\e | ESCAPE | ✘ | ✔ | ✔ | ||||
\E | end of \Q … \E quoting | ✘ | ✔ | ✔ | ||||
\f | FORM FEED | ✔ | ✔ | ✔ | ||||
\G | end of previous match | ✘ | ✔ | ✔ | ||||
\n | LINE FEED | ✔ | ✔ | |||||
\N{x} | UNICODE CHARACTER NAME x | ✘ | ✘ | ✔ | ||||
\p{x} | UNICODE PROPERTY NAME x | ✘ | ✔ | ✔ | ||||
\P{x} | UNICODE PROPERTY NAME not x | ✘ | ✔ | ✔ | ||||
\Q | start of \Q … \E quoting | ✘ | ✔ | ✔ | ||||
\r | CARRIAGE RETURN | ✔ | ✔ | ✔ | ||||
\s | white space | ✔ | ✔ | ✔ | ||||
\S | non white space | ✔ | ✔ | ✔ | ||||
\t | HORIZONTAL TAB | ✔ | ✔ | ✔ | ||||
\uhhhh | U+hhhh (between U+0000 and U+FFFF) | ✘ | ✘ | |||||
\Uhhhhhhhh | U+hhhhhhhh (between U+00000000 and U+0010FFFF) | ✘ | ✘ | |||||
\v | VERTICAL TAB | ✔ | ✔ | ✘ | ||||
\w | word character | ✔ | ✔ | ✔ | ||||
\W | non word character | ✔ | ✔ | ✔ | ||||
\x{hhhh} | U+hhhh | ✘ | ✔ (0-10FFFF) | ✔ (0-10FFFF) | ||||
\xhhhh | U+hhhh | ✔ (0000-FFFF) | ✔ (00-FF) | ✔ (00-FF) | ||||
\X | grapheme cluster | ✘ | ✘ | |||||
\Z | end of input (or before the final ) | ✘ | ✔ | ✔ | ||||
\z | end of input | ✘ | ✔ | ✔ | ||||
\n | n-th backreference | ✔ | ✔ | |||||
\0ooo | ASCII/Latin-1 character 0ooo | ✔ | ✔ | |||||
. | any character but newlines | ✔ | ✔ | ✔ | ||||
^ | line beginning | ✔ | ✔ | ✔ | ||||
$ | line end | ✔ | ✔ | ✔ | ||||
\ | quote the following symbol | ✔ | ✔ | ✔ | ||||
[pattern] | set | ✔ | ✔ | ✔ |
Supported syntax: Operators
Operator | QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|---|
* | match 0 or more times | ✔ | ✔ | ✔ | ? | ? | ✔ | |
+ | match 1 or more times | ✔ | ✔ | ✔ | ✔ | |||
? | match 0 or 1 times | ✔ | ✔ | ✔ | ✔ | |||
{n} | match n times | ✔ | ✔ | ✔ | ✔ | |||
{n,} | match n or more times | ✔ | ✔ | ✔ | ✔ | |||
{n,m} | match between n and m times | ✔ | ✔ | ✔ | ✔ | |||
*? | match 0 or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
? | match 1 or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
?? | match 0 or 1 times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
{n}? | match n times | ✘ | ✔ | ✔ | ✔ | |||
{n,}? | match n or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
{n,m}? | match between n and m times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
*+ | match 0 or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
+ | match 1 or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
? | match 0 or 1 times, possessive | ✘ | ✔ | ✔ | ✘ | |||
{n}+ | match n times | ✘ | ✔ | ✔ | ✘ | |||
{n,}+ | match n or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
{n,m}+ | match between n and m times, possessive | ✘ | ✔ | ✔ | ✘ | |||
( … ) | capturing group | ✔ | ✔ | ✔ | ✔ | |||
(?: … ) | group | ✔ | ✔ | ✔ | ✔ | |||
(?> … ) | atomic grouping | ✘ | ✔ | ✔ | ✘ | |||
(?# … ) | comment | ✘ | ✔ | ✔ | ✘ | |||
(?= … ) | look-ahead assertion | ✔ | ✔ | ✔ | ✘ | |||
(?! … ) | negative look-ahead assertion | ✔ | ✔ | ✔ | ✘ | |||
(?<= … ) | look-behind assertion | ✘ | ✔ | ✔ | ✘ | |||
(?<! … ) | negative look-behind assertion | ✘ | ✔ | ✔ | ✘ | |||
(?flags: … ) | flags change | ✘ | ✔ | ✔ | ||||
(?flags) | flags change | ✘ | ✔ | ✔ | ✔ | |||
(?P<name> …) | named capturing group | ✘ | ✔ | ✘ | ✔ | |||
(?<name> …) | named capturing group | ✘ | ✔ | ✘ | ✘ | |||
(?'name' …) | named capturing group | ✘ | ✔ | ✘ | ✘ | |||
…) | branch reset | ✘ | ✔ | ✘ | ✘ |
Supported syntax: flags
Flag | QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|---|
/i | case insensitive | ✔ | ✔ | ✔ | ✔ | |||
/m | multi-line | ✘ | ✔ | ✔ | ✔ | |||
/s | dot matches anything | ~[1] | ✔ | ✔ | ✔ | |||
/x | ignore whitespace and comments | ✘ | ✔ | ✔ | ✔ | |||
/U | minimal match | ✔ | ✔ | ✘ | ✔ |
Benchmarks
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.
- ↑ It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.