Regexp engine in Qt5: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
(Decode HTML entity numbers)
No edit summary
Line 58: Line 58:


== Feature matrix ==
== Feature matrix ==
|''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|
{| class="wikitable"
|''. General comments|See above.| | | | | | |
!
|''. Already being used in Qt?| Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |
! QRegExp  
|''. Pros | |Widely used, de-facto standard implementation for Perl-like regexps.| |Uses UTF-16 natively. | | |very fast, use a DFA|
! PCRE
|''. Cons | |Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |Boost does not give guarantees about ABI compatibility.| |uses UTF-8, doesn't have the lookbehind neither lookahead |
! V8
|''. Fixes T1 | |✔| | ✔| | |✔ |
! ICU
|''. Fixes T2 | |✔| | ✔| | |✔ |
! Boost.Regex
|''. Fixes T3 | |✔| |✔| | |? |
! std::regex
|''. Fixes T4 | |✔| |✘| | | ✘|
! RE2
|''. Fixes T5 | |?| |?| | |✘ |
|-
|''. Fixes T6 | |✔ ("by hand", with partial matching)| |Maybe yes, see [http://userguide.icu-project.org/strings/utext UText] .| | |✔ see StringPiece |
! General comments|See above.
|''. Fixes T7 | |✔| |✔| | | ✘|
|  
|  
|  
|  
|  
|  
|-
! Already being used in Qt?
| Yes  
| Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed.  
| Yes (Qt 5)  
| libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale.  
| No  
| No  
| No  
|-
! Pros  
|  
|Widely used, de-facto standard implementation for Perl-like regexps.
|  
|Uses UTF-16 natively.  
|  
|  
|very fast, use a DFA
|-
! Cons  
|  
|Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2].  
|  
|Boost does not give guarantees about ABI compatibility.
|  
|uses UTF-8, doesn't have the lookbehind neither lookahead  
|-
! Fixes T1  
|  
|✔
|  
| ✔
|  
|  
|✔  
|-
! Fixes T2  
|  
|✔
|  
| ✔
|  
|  
|✔  
|-
! Fixes T3  
|  
|✔
|  
|✔
|  
|  
|?  
|-
! Fixes T4  
|  
|✔
|  
|✘
|  
|  
| ✘
|-
! Fixes T5  
|  
|?
|  
|?
|  
|  
|✘  
|-
! Fixes T6  
|  
|✔ ("by hand", with partial matching)
|  
|Maybe yes, see [http://userguide.icu-project.org/strings/utext UText] .
|  
|  
|✔ see StringPiece  
|-
! Fixes T7  
|  
|✔
|  
|✔
|  
|  
| ✘
|}


✘✔
✘✔
Line 76: Line 171:


=== Supported syntax: Characters ===
=== Supported syntax: Characters ===
|''. |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|
 
|''. |BELL|✔|✔| |✔| | | |
{| class="wikitable"
|''. |beginning of input| |✔| |✔| | | |
!
|''. inside a [set]|BACKSPACE|?|✔| | | | | |
!
|''. outside a [set]|on a word boundary|✔|✔| | | | | |
! QRegExp  
|''. |not on a word boundary|✔|✔| |✔| | | |
! PCRE
|''. |ASCII control character X|✘|✔| |✔| | | |
! V8
|''. |digit|✔|✔| |✔| | | |
! ICU
|''. |non digit|✔|✔| |✔| | | |
! Boost.Regex
|''. |ESCAPE|✘|✔| |✔| | | |
! std::regex
|''. |end of … quoting|✘|✔| |✔| | | |
! RE2
|''. |FORM FEED|✔|✔| |✔| | | |
|-
|''. |end of previous match|✘|✔| |✔| | | |
!
|''. |LINE FEED|✔|✔| | | | | |
|BELL
|''. {x}|UNICODE CHARACTER NAME x|✘|✘| |✔| | | |
|✔
|''. {x}|UNICODE PROPERTY NAME x|✘|✔| |✔| | | |
|✔
|''. {x}|UNICODE PROPERTY NAME not x|✘|✔| |✔| | | |
|  
|''. |start of … quoting|✘|✔| |✔| | | |
|✔
|''. |CARRIAGE RETURN|✔|✔| |✔| | | |
|  
|''. |white space|✔|✔| |✔| | | |
|  
|''. |non white space|✔|✔| |✔| | | |
|  
|''. |HORIZONTAL TAB|✔|✔| |✔| | | |
|-
|''. |U+hhhh (between U+0000 and U+FFFF)|✘|✘| | | | | |
!
|''. |U+hhhhhhhh (between U+00000000 and U+0010FFFF)|✘|✘| | | | | |
|beginning of input
|''. VERTICAL TAB|✔|✔| |✘| | | |
|  
|''. |word character|✔|✔| |✔| | | |
|✔
|''. |non word character|✔|✔| |✔| | | |
|  
|''. {hhhh}|U+hhhh|✘|✔ (0-10FFFF)| |✔ (0-10FFFF)| | | |
|✔
|''. |U+hhhh|✔ (0000-FFFF)|✔ (00-FF)| |✔ (00-FF)| | | |
|  
|''. |grapheme cluster|✘|✘| | | | | |
|  
|''. |end of input (or before the final )|✘|✔| |✔| | | |
|  
|''. |end of input|✘|✔| |✔| | | |
|-
|''. |n-th backreference|✔|✔| | | | | |
! inside a [set]
|''. ooo|ASCII/Latin-1 character 0ooo|✔|✔| | | | | |
|BACKSPACE
|''. .|any character but newlines|✔|✔| |✔| | | |
|?
|''. ^|line beginning|✔|✔| |✔| | | |
|✔
|''. $|line end|✔|✔| |✔| | | |
|  
|''. quote the following symbol|✔|✔| |✔| | | |
|  
|''. [pattern]|set|✔|✔| |✔| | | |
|  
|  
|  
|-
! outside a [set]
|on a word boundary
|✔
|✔
|  
|  
|  
|  
|  
|-
!
|not on a word boundary
|✔
|✔
|  
|✔
|  
|
|  
|-
!
|ASCII control character X
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|digit
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|non digit
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|ESCAPE
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|end of … quoting
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|FORM FEED
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|end of previous match
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|LINE FEED
|✔
|✔
|  
|  
|  
|  
|  
|-
! {x}
|UNICODE CHARACTER NAME x
|✘
|✘
|  
|✔
|  
|  
|  
|-
! {x}
|UNICODE PROPERTY NAME x
|✘
|✔
|  
|✔
|  
|  
|  
|-
! {x}
|UNICODE PROPERTY NAME not x
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|start of … quoting
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|CARRIAGE RETURN
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|white space
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|non white space
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|HORIZONTAL TAB
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|U+hhhh (between U+0000 and U+FFFF)
|✘
|✘
|  
|  
|  
|  
|  
|-
!
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|✘
|✘
|  
|  
|  
|  
|  
|-
! VERTICAL TAB
|✔
|✔
|  
|✘
|  
|  
|  
|-
!
|word character
|✔
|✔
|  
|✔
|  
|  
|  
|-
!
|non word character
|✔
|✔
|  
|✔
|  
|  
|  
|-
! {hhhh}
|U+hhhh
|✘
|✔ (0-10FFFF)
|  
|✔ (0-10FFFF)
|  
|  
|  
|-
!
|U+hhhh
|✔ (0000-FFFF)
|✔ (00-FF)
|  
|✔ (00-FF)
|  
|  
|  
|-
!
|grapheme cluster
|✘
|✘
|  
|  
|  
|  
|  
|-
!
|end of input (or before the final )
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|end of input
|✘
|✔
|  
|✔
|  
|  
|  
|-
!
|n-th backreference
|✔
|✔
|  
|  
|  
|  
|  
|-
! ooo
|ASCII/Latin-1 character 0ooo
|✔
|✔
|  
|  
|  
|  
|  
|-
! .
|any character but newlines
|✔
|✔
|  
|✔
|  
|  
|  
|-
! ^
|line beginning
|✔
|✔
|  
|✔
|  
|  
|  
|-
! $
|line end
|✔
|✔
|  
|✔
|  
|  
|  
|-
! quote the following symbol
|✔
|✔
|  
|✔
|  
|  
|  
|-
! [pattern]
|set
|✔
|✔
|  
|✔
|  
|  
|  
|}





Revision as of 08:02, 14 April 2015

This article may require cleanup to meet the Qt Wiki's quality standards. Reason: Auto-imported from ExpressionEngine.
Please improve this article if you can. Remove the {{cleanup}} tag and add this page to Updated pages list after it's clean.

[toc align_right="yes" depth="3"]

Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
    • Non-greedy quantifiers (see page 141 titled "- 129 -")

' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.

  • Patternist/XPath also needs Regex features not found in QRegExp, including

' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )

  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem
  • Customer complained about QRegExp (though I don't see what's their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern "." returns 3, not 0

T6: only linear input is supported, for a text editor like Kate this does not scale

  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

Feature matrix

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
See above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Does not run on every platform supported by QtCore / QtBase[2]. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn't have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ ("by hand", with partial matching) Maybe yes, see UText . ✔ see StringPiece
Fixes T7

✘✔


Supported syntax: Characters

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
BELL
beginning of input
inside a [set] BACKSPACE ?
outside a [set] on a word boundary
not on a word boundary
ASCII control character X
digit
non digit
ESCAPE
end of … quoting
FORM FEED
end of previous match
LINE FEED
{x} UNICODE CHARACTER NAME x
{x} UNICODE PROPERTY NAME x
{x} UNICODE PROPERTY NAME not x
start of … quoting
CARRIAGE RETURN
white space
non white space
HORIZONTAL TAB
U+hhhh (between U+0000 and U+FFFF)
U+hhhhhhhh (between U+00000000 and U+0010FFFF)
VERTICAL TAB
word character
non word character
{hhhh} U+hhhh ✔ (0-10FFFF) ✔ (0-10FFFF)
U+hhhh ✔ (0000-FFFF) ✔ (00-FF) ✔ (00-FF)
grapheme cluster
end of input (or before the final )
end of input
n-th backreference
ooo ASCII/Latin-1 character 0ooo
. any character but newlines
^ line beginning
$ line end
quote the following symbol
[pattern] set


Supported syntax: Operators

|. Operator |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2| |. * |match 0 or more times|✔|✔| |✔|?|?|✔| |. + |match 1 or more times|✔|✔| |✔| | |✔| |. ? |match 0 or 1 times|✔|✔| |✔| | |✔| |. {n} |match n times|✔|✔| |✔| | |✔| |. {n,} |match n or more times|✔|✔| |✔| | |✔| |. {n,m} |match between n and m times|✔|✔| |✔| | |✔| |. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔| |. ? |match 1 or more times, not greedy|✘|✔| |✔| | |✔| |. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔| |. {n}? |match n times|✘|✔| |✔| | |✔| |. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔| |. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔| |. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘| |.+ |match 1 or more times, possessive|✘|✔| |✔| | |✘| |. ? |match 0 or 1 times, possessive|✘|✔| |✔| | |✘| |. {n}+ |match n times|✘|✔| |✔| | |✘| |. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘| |. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘| |. ( … ) |capturing group|✔|✔| |✔| | |✔| |. (?: … ) |group|✔|✔| |✔| | |✔| |. (?> … ) |atomic grouping|✘|✔| |✔| | |✘| |. (?# … ) |comment|✘|✔| |✔| | |✘| |. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘| |. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘| |. (?<= … ) |look-behind assertion|✘|✔| |✔| | |✘| |. (?<! … ) |negative look-behind assertion|✘|✔| |✔| | |✘| |. (?flags: … ) |flags change|✘|✔| | | | |✔| |. (?flags) |flags change|✘|✔| |✔| | |✔| |. (?P<name> …) |named capturing group|✘|✔| |✘| | |✔| |. (?<name> …) |named capturing group|✘|✔| |✘| | |✘| |. (?'name' …) |named capturing group|✘|✔| |✘| | |✘| |. (?| …) |branch reset|✘|✔| |✘| | |✘| |. | | | | | | | | | |. | | | | | | | | |


Supported syntax: flags

|. Flag |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2| |. /i |case insensitive|✔|✔| |✔| | |✔| |. /m |multi-line|✘|✔| |✔| | |✔| |. /s |dot matches anything|~[1] |✔| |✔| | |✔| |. /x |ignore whitespace and comments|✘|✔| |✔| | |✔| |_. /U |minimal match|✔|✔| |✘| | |✔|

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.