Regexp engine in Qt5: Difference between revisions
Jump to navigation
Jump to search
m (Fixed import issue) |
m (Fixed general comments line in table Feature matrix) |
||
Line 68: | Line 68: | ||
! RE2 | ! RE2 | ||
|- | |- | ||
! General comments|See above. | ! General comments | ||
| See above. | |||
| | | | ||
| | | |
Revision as of 15:20, 22 May 2015
This article may require cleanup to meet the Qt Wiki's quality standards. Reason: Auto-imported from ExpressionEngine. Please improve this article if you can. Remove the {{cleanup}} tag and add this page to Updated pages list after it's clean. |
Regular expression engine in Qt5
This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.
- http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html
- https://bugreports.qt.io/browse/QTBUG-20888
Current issues with QRegExp
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html
High level issues
- QRegExp API is broken (see T7 in Low Level)
- QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
- Non-greedy quantifiers (see page 141 titled "- 129 -")
- But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
- Patternist/XPath also needs Regex features not found in QRegExp, including
- Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
- Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem …
- Customer complained about QRegExp (though I don't see what's their exact problem):
- In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
- Public task request:
- Lookbehind (T4) (bug 217916)
- Support for POSIX syntax (bug 218604)
- Removing const modifiers (T7) (bug 219234, bug 209041)
- Non-greedy quantifiers (T3) (bug 116127)
Low Level issues
- T1: ^ (caret) and $ (dollar) cannot match at each newline
- T2: . (dot) always matches newlines
- T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
- T4: lookbehind is not supported (lookahead is)
- T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
- T6: only linear input is supported, for a text editor like Kate this does not scale
- T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
Future
- It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
- QRegExp likely to be moved into its own module in order to keep source compatibility.
- Addresses the above low-level issues (as much as possible).
- (Nice to have) At least the same syntax / features than std::regex
Proposed libraries
- PCRE
- "V8"
- ICU
- Boost.Regex
- std::regex (new in C+11)
- RE2
Feature matrix
QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|
General comments | See above. | ||||||
Already being used in Qt? | Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |
Pros | Widely used, de-facto standard implementation for Perl-like regexps. | Uses UTF-16 natively. | very fast, use a DFA | ||||
Cons | Does not run on every platform supported by QtCore / QtBase[2]. | Boost does not give guarantees about ABI compatibility. | uses UTF-8, doesn't have the lookbehind neither lookahead | ||||
Fixes T1 | ✔ | ✔ | ✔ | ||||
Fixes T2 | ✔ | ✔ | ✔ | ||||
Fixes T3 | ✔ | ✔ | ? | ||||
Fixes T4 | ✔ | ✘ | ✘ | ||||
Fixes T5 | ? | ? | ✘ | ||||
Fixes T6 | ✔ ("by hand", with partial matching) | Maybe yes, see UText . | ✔ see StringPiece | ||||
Fixes T7 | ✔ | ✔ | ✘ |
✘✔
Supported syntax: Characters
QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | ||
---|---|---|---|---|---|---|---|---|
BELL | ✔ | ✔ | ✔ | |||||
beginning of input | ✔ | ✔ | ||||||
inside a [set] | BACKSPACE | ? | ✔ | |||||
outside a [set] | on a word boundary | ✔ | ✔ | |||||
not on a word boundary | ✔ | ✔ | ✔ | |||||
ASCII control character X | ✘ | ✔ | ✔ | |||||
digit | ✔ | ✔ | ✔ | |||||
non digit | ✔ | ✔ | ✔ | |||||
ESCAPE | ✘ | ✔ | ✔ | |||||
end of … quoting | ✘ | ✔ | ✔ | |||||
FORM FEED | ✔ | ✔ | ✔ | |||||
end of previous match | ✘ | ✔ | ✔ | |||||
LINE FEED | ✔ | ✔ | ||||||
{x} | UNICODE CHARACTER NAME x | ✘ | ✘ | ✔ | ||||
{x} | UNICODE PROPERTY NAME x | ✘ | ✔ | ✔ | ||||
{x} | UNICODE PROPERTY NAME not x | ✘ | ✔ | ✔ | ||||
start of … quoting | ✘ | ✔ | ✔ | |||||
CARRIAGE RETURN | ✔ | ✔ | ✔ | |||||
white space | ✔ | ✔ | ✔ | |||||
non white space | ✔ | ✔ | ✔ | |||||
HORIZONTAL TAB | ✔ | ✔ | ✔ | |||||
U+hhhh (between U+0000 and U+FFFF) | ✘ | ✘ | ||||||
U+hhhhhhhh (between U+00000000 and U+0010FFFF) | ✘ | ✘ | ||||||
VERTICAL TAB | ✔ | ✔ | ✘ | |||||
word character | ✔ | ✔ | ✔ | |||||
non word character | ✔ | ✔ | ✔ | |||||
{hhhh} | U+hhhh | ✘ | ✔ (0-10FFFF) | ✔ (0-10FFFF) | ||||
U+hhhh | ✔ (0000-FFFF) | ✔ (00-FF) | ✔ (00-FF) | |||||
grapheme cluster | ✘ | ✘ | ||||||
end of input (or before the final ) | ✘ | ✔ | ✔ | |||||
end of input | ✘ | ✔ | ✔ | |||||
n-th backreference | ✔ | ✔ | ||||||
ooo | ASCII/Latin-1 character 0ooo | ✔ | ✔ | |||||
. | any character but newlines | ✔ | ✔ | ✔ | ||||
^ | line beginning | ✔ | ✔ | ✔ | ||||
$ | line end | ✔ | ✔ | ✔ | ||||
quote the following symbol | ✔ | ✔ | ✔ | |||||
[pattern] | set | ✔ | ✔ | ✔ |
Supported syntax: Operators
Operator | QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|---|
* | match 0 or more times | ✔ | ✔ | ✔ | ? | ? | ✔ | |
+ | match 1 or more times | ✔ | ✔ | ✔ | ✔ | |||
? | match 0 or 1 times | ✔ | ✔ | ✔ | ✔ | |||
{n} | match n times | ✔ | ✔ | ✔ | ✔ | |||
{n,} | match n or more times | ✔ | ✔ | ✔ | ✔ | |||
{n,m} | match between n and m times | ✔ | ✔ | ✔ | ✔ | |||
*? | match 0 or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
? | match 1 or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
?? | match 0 or 1 times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
{n}? | match n times | ✘ | ✔ | ✔ | ✔ | |||
{n,}? | match n or more times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
{n,m}? | match between n and m times, not greedy | ✘ | ✔ | ✔ | ✔ | |||
*+ | match 0 or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
+ | match 1 or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
? | match 0 or 1 times, possessive | ✘ | ✔ | ✔ | ✘ | |||
{n}+ | match n times | ✘ | ✔ | ✔ | ✘ | |||
{n,}+ | match n or more times, possessive | ✘ | ✔ | ✔ | ✘ | |||
{n,m}+ | match between n and m times, possessive | ✘ | ✔ | ✔ | ✘ | |||
( … ) | capturing group | ✔ | ✔ | ✔ | ✔ | |||
(?: … ) | group | ✔ | ✔ | ✔ | ✔ | |||
(?> … ) | atomic grouping | ✘ | ✔ | ✔ | ✘ | |||
(?# … ) | comment | ✘ | ✔ | ✔ | ✘ | |||
(?= … ) | look-ahead assertion | ✔ | ✔ | ✔ | ✘ | |||
(?! … ) | negative look-ahead assertion | ✔ | ✔ | ✔ | ✘ | |||
(?<= … ) | look-behind assertion | ✘ | ✔ | ✔ | ✘ | |||
(?<! … ) | negative look-behind assertion | ✘ | ✔ | ✔ | ✘ | |||
(?flags: … ) | flags change | ✘ | ✔ | ✔ | ||||
(?flags) | flags change | ✘ | ✔ | ✔ | ✔ | |||
(?P<name> …) | named capturing group | ✘ | ✔ | ✘ | ✔ | |||
(?<name> …) | named capturing group | ✘ | ✔ | ✘ | ✘ | |||
(?'name' …) | named capturing group | ✘ | ✔ | ✘ | ✘ | |||
…) | branch reset | ✘ | ✔ | ✘ | ✘ |
Supported syntax: flags
Flag | QRegExp | PCRE | V8 | ICU | Boost.Regex | std::regex | RE2 | |
---|---|---|---|---|---|---|---|---|
/i | case insensitive | ✔ | ✔ | ✔ | ✔ | |||
/m | multi-line | ✘ | ✔ | ✔ | ✔ | |||
/s | dot matches anything | ~[1] | ✔ | ✔ | ✔ | |||
/x | ignore whitespace and comments | ✘ | ✔ | ✔ | ✔ | |||
/U | minimal match | ✔ | ✔ | ✘ | ✔ |
Benchmarks
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.
- ↑ It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.