Regexp engine in Qt5: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
No edit summary
 
No edit summary
Line 1: Line 1:
=Regular expression engine in Qt5=
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]]<br />[toc align_right=&quot;yes&amp;quot; depth=&quot;3&amp;quot;]


This page summarizes the research for an alternative regular expression engine to be used in [[Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.
= Regular expression engine in Qt5 =


* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.<br />* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html<br />* https://bugreports.qt.nokia.com/browse/QTBUG-20888
* https://bugreports.qt.nokia.com/browse/QTBUG-20888


==Current issues with QRegExp==
== Current issues with QRegExp ==


From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html
From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html


===High level issues===
=== High level issues ===


* QRegExp <span class="caps">API</span> is broken (see '''T7''' in Low Level)
* QRegExp API is broken (see '''T7''' in Low Level)
* QRegExp is used for QtScript though it does not fullfill the <span class="caps">ECMAS</span>cript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm <span class="caps">ECMA</span>-262-1999] ''[ecma-international.org]'' . Missing features include
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification &quot;ECMA-262-1999&amp;quot;:http://www.ecma-international.org/publications/standards/Ecma-262.htm . Missing features include
** Non-greedy quantifiers (see page 141 titled - 129 -”)
** Non-greedy quantifiers (see page 141 titled &quot;- 129 <s>&quot;)<br />'''''' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.<br />* Patternist/XPath also needs Regex features not found in QRegExp, including<br />'''''' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )<br />* Qt Creator might want to offer multi-line Regex search</s> and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly &quot;that problem&amp;quot;:http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 …
** But current implementation of QtScript uses <span class="caps">JSC</span> which uses its own engine anyway, and only use QRegExp in its api as a container.
* Customer complained about QRegExp (though I don't see what's their exact problem):
* Patternist/XPath also needs Regex features not found in QRegExp, including
** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.''
** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] ''[bugzilla.gnome.org]''
* Customer complained about QRegExp (though I don’t see what’s their exact problem):
** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the <span class="caps">PCRE</span> (Perl Compatible Regular Expressions) library.''
* Public task request:
* Public task request:
** Lookbehind ('''T4''') ([http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry bug 217916] ''[trolltech.com]'')
** Lookbehind ('''T4''') (&quot;bug 217916&amp;quot;:http://trolltech.com/developer/task-tracker/index_html?id=217916&amp;amp;method=entry)
** Support for <span class="caps">POSIX</span> syntax ([http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry bug 218604] ''[trolltech.com]'')
** Support for POSIX syntax (&quot;bug 218604&amp;quot;:http://trolltech.com/developer/task-tracker/index_html?id=218604&amp;amp;method=entry)
** Removing const modifiers ('''T7''') ([http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry bug 219234] ''[trolltech.com]'', [http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry bug 209041] ''[trolltech.com]'')
** Removing const modifiers ('''T7''') (&quot;bug 219234&amp;quot;:http://trolltech.com/developer/task-tracker/index_html?id=219234&amp;amp;method=entry, &quot;bug 209041&amp;quot;:http://trolltech.com/developer/task-tracker/index_html?id=209041&amp;amp;method=entry)
** Non-greedy quantifiers ('''T3''') ([http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry bug 116127] ''[trolltech.com]'')
** Non-greedy quantifiers ('''T3''') (&quot;bug 116127&amp;quot;:http://trolltech.com/developer/task-tracker/index_html?id=116127&amp;amp;method=entry)


===Low Level issues===
=== Low Level issues ===


* '''T1''': ^ (caret) and $ (dollar) cannot match at each newline
* '''T1''': ^ (caret) and $ (dollar) cannot match at each newline
Line 33: Line 28:
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T4''': lookbehind is not supported (lookahead is)
* '''T4''': lookbehind is not supported (lookahead is)
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn(“abcd”) for pattern .*” returns 3, not 0
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn(&quot;abcd&amp;quot;) for pattern &quot;.'''&quot; returns 3, not 0<br />''' '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.


=Future=
= Future =


* It '''must''' be a solid 3rd party engine — don’t want to develop an in-house engine and maintain it.
* It '''must''' be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
* QRegExp likely to be moved into its own module in order to keep source compatibility.
* QRegExp likely to be moved into its own module in order to keep source compatibility.
* Addresses the above low-level issues (as much as possible).
* Addresses the above low-level issues (as much as possible).
* (Nice to have) At least the same syntax / features than std::regex
* (Nice to have) At least the same syntax / features than std::regex


==Proposed libraries==
== Proposed libraries ==


* [http://www.pcre.org/ <span class="caps">PCRE</span>] ''[pcre.org]''
* &quot;PCRE&amp;quot;:http://www.pcre.org/
* “V8”
* &quot;V8&amp;quot;
* [http://userguide.icu-project.org/strings/regexp <span class="caps">ICU</span>] ''[userguide.icu-project.org]''
* &quot;ICU&amp;quot;:http://userguide.icu-project.org/strings/regexp
* [http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html Boost.Regex] ''[boost.org]''
* &quot;Boost.Regex&amp;quot;:http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html
* std::regex (new in C++11)
* std::regex (new in C+''11)<br />* &quot;RE2&amp;quot;:http://code.google.com/p/re2/
* [http://code.google.com/p/re2/ RE2] ''[code.google.com]''
<br />h2. Feature matrix
<br />|''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|<br />|''. General comments|See above.| | | | | | |<br />|''. Already being used in Qt?| Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |<br />|''. Pros | |Widely used, de-facto standard implementation for Perl-like regexps.| |Uses UTF-16 natively. | | |very fast, use a DFA|<br />|''. Cons | |Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |Boost does not give guarantees about ABI compatibility.| |uses UTF-8, doesn't have the lookbehind neither lookahead |<br />|''. Fixes T1 | |✔| | ✔| | |✔ |<br />|''. Fixes T2 | |✔| | ✔| | |✔ |<br />|''. Fixes T3 | |✔| |✔| | |? |<br />|''. Fixes T4 | |✔| |✘| | | ✘|<br />|''. Fixes T5 | |?| |?| | |✘ |<br />|''. Fixes T6 | |✔ (&quot;by hand&amp;quot;, with partial matching)| |Maybe yes, see &quot;UText&amp;quot;:http://userguide.icu-project.org/strings/utext .| | |✔ see StringPiece |<br />|''. Fixes T7 | |✔| |✔| | | ✘|
<br />✘✔


==Feature matrix==


✘✔
<br />h3. Supported syntax: Characters
<br />|''. |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|<br />|''. |BELL|✔|✔| |✔| | | |<br />|''. |beginning of input| |✔| |✔| | | |<br />|''. inside a [set]|BACKSPACE|?|✔| | | | | |<br />|''. outside a [set]|on a word boundary|✔|✔| | | | | |<br />|''. |not on a word boundary|✔|✔| |✔| | | |<br />|''. |ASCII control character X|✘|✔| |✔| | | |<br />|''. |digit|✔|✔| |✔| | | |<br />|''. |non digit|✔|✔| |✔| | | |<br />|''. |ESCAPE|✘|✔| |✔| | | |<br />|''. |end of … quoting|✘|✔| |✔| | | |<br />|''. |FORM FEED|✔|✔| |✔| | | |<br />|''. |end of previous match|✘|✔| |✔| | | |<br />|''. |LINE FEED|✔|✔| | | | | |<br />|''. {x}|UNICODE CHARACTER NAME x|✘|✘| |✔| | | |<br />|''. {x}|UNICODE PROPERTY NAME x|✘|✔| |✔| | | |<br />|''. {x}|UNICODE PROPERTY NAME not x|✘|✔| |✔| | | |<br />|''. |start of … quoting|✘|✔| |✔| | | |<br />|''. |CARRIAGE RETURN|✔|✔| |✔| | | |<br />|''. |white space|✔|✔| |✔| | | |<br />|''. |non white space|✔|✔| |✔| | | |<br />|''. |HORIZONTAL TAB|✔|✔| |✔| | | |<br />|''. |U+hhhh (between U+0000 and U+FFFF)|✘|✘| | | | | |<br />|''. |U+hhhhhhhh (between U+00000000 and U+0010FFFF)|✘|✘| | | | | |<br />|''. VERTICAL TAB|✔|✔| |✘| | | |<br />|''. |word character|✔|✔| |✔| | | |<br />|''. |non word character|✔|✔| |✔| | | |<br />|''. {hhhh}|U+hhhh|✘|✔ (0-10FFFF)| |✔ (0-10FFFF)| | | |<br />|''. |U+hhhh|✔ (0000-FFFF)|✔ (00-FF)| |✔ (00-FF)| | | |<br />|''. |grapheme cluster|✘|✘| | | | | |<br />|''. |end of input (or before the final )|✘|✔| |✔| | | |<br />|''. |end of input|✘|✔| |✔| | | |<br />|''. |n-th backreference|✔|✔| | | | | |<br />|''. ooo|ASCII/Latin-1 character 0ooo|✔|✔| | | | | |<br />|''. .|any character but newlines|✔|✔| |✔| | | |<br />|''. ^|line beginning|✔|✔| |✔| | | |<br />|''. $|line end|✔|✔| |✔| | | |<br />|''. quote the following symbol|✔|✔| |✔| | | |<br />|''. [pattern]|set|✔|✔| |✔| | | |


===Supported syntax: Characters===
<br />h3. Supported syntax: Operators
<br />|''. Operator |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|<br />|''. * |match 0 or more times|✔|✔| |✔|?|?|✔|<br />|''. + |match 1 or more times|✔|✔| |✔| | |✔|<br />|''. ? |match 0 or 1 times|✔|✔| |✔| | |✔|<br />|''. {n} |match n times|✔|✔| |✔| | |✔|<br />|''. {n,} |match n or more times|✔|✔| |✔| | |✔|<br />|''. {n,m} |match between n and m times|✔|✔| |✔| | |✔|<br />|''. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔|<br />|''. ''? |match 1 or more times, not greedy|✘|✔| |✔| | |✔|<br />|''. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔|<br />|''. {n}? |match n times|✘|✔| |✔| | |✔|<br />|''. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔|<br />|''. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔|<br />|''. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘|<br />|''.''+ |match 1 or more times, possessive|✘|✔| |✔| | |✘|<br />|''. ?'' |match 0 or 1 times, possessive|✘|✔| |✔| | |✘|<br />|''. {n}+ |match n times|✘|✔| |✔| | |✘|<br />|''. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘|<br />|''. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘|<br />|''. ( … ) |capturing group|✔|✔| |✔| | |✔|<br />|''. (?: … ) |group|✔|✔| |✔| | |✔|<br />|''. (?&amp;gt; … ) |atomic grouping|✘|✔| |✔| | |✘|<br />|''. (?# … ) |comment|✘|✔| |✔| | |✘|<br />|''. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘|<br />|''. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘|<br />|''. (?&lt;= … ) |look-behind assertion|✘|✔| |✔| | |✘|<br />|''. (?&lt;! … ) |negative look-behind assertion|✘|✔| |✔| | |✘|<br />|''. (?flags: … ) |flags change|✘|✔| | | | |✔|<br />|''. (?flags) |flags change|✘|✔| |✔| | |✔|<br />|''. (?P&amp;lt;name&amp;gt; …) |named capturing group|✘|✔| |✘| | |✔|<br />|''. (?&lt;name&amp;gt; …) |named capturing group|✘|✔| |✘| | |✘|<br />|''. (?'name' …) |named capturing group|✘|✔| |✘| | |✘|<br />|''. (?&amp;#x7c; …) |branch reset|✘|✔| |✘| | |✘|<br />|''. | | | | | | | | |<br />|''. | | | | | | | | |


{| class="infotable line"
<br />h3. Supported syntax: flags
!
<br />|''. Flag |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|<br />|''. /i |case insensitive|✔|✔| |✔| | |✔|<br />|''. /m |multi-line|✘|✔| |✔| | |✔|<br />|''. /s |dot matches anything|~<ref>It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.</ref> |✔| |✔| | |✔|<br />|''. /x |ignore whitespace and comments|✘|✔| |✔| | |✔|<br />|_. /U |minimal match|✔|✔| || | |✔|
!
! QRegExp
! <span class="caps">PCRE</span>
! V8
! <span class="caps">ICU</span>
! Boost.Regex
! std::regex
! RE2
|-
! \a
| <span class="caps">BELL</span>
|
|
|
|
|
|
|
|-
! \A
| beginning of input
|
| ✔
|
| ✔
|
|
|
|-
! \b inside a [set]
| <span class="caps">BACKSPACE</span>
| ?
| ✔
|
|
|
|
|
|-
! \b outside a [set]
| on a word boundary
| ✔
| ✔
|
|
|
|
|
|-
! \B
| not on a word boundary
| ✔
| ✔
|
| ✔
|
|
|
|-
! \cX
| <span class="caps">ASCII</span> control character X
| ✘
| ✔
|
| ✔
|
|
|
|-
! \d
| digit
| ✔
| ✔
|
| ✔
|
|
|
|-
! \D
| non digit
| ✔
| ✔
|
| ✔
|
|
|
|-
! \e
| <span class="caps">ESCAPE</span>
|
| ✔
|
| ✔
|
|
|
|-
! \E
| end of \Q … \E quoting
| ✘
| ✔
|
| ✔
|
|
|
|-
! \f
| <span class="caps">FORM</span> <span class="caps">FEED</span>
| ✔
| ✔
|
| ✔
|
|
|
|-
! \G
| end of previous match
| ✘
| ✔
|
| ✔
|
|
|
|-
! \n
| <span class="caps">LINE</span> <span class="caps">FEED</span>
| ✔
| ✔
|
|
|
|
|
|-
! \N{x}
| <span class="caps">UNICODE</span> <span class="caps">CHARACTER</span> <span class="caps">NAME</span> x
|
| ✘
|
| ✔
|
|
|
|-
! \p{x}
| <span class="caps">UNICODE</span> <span class="caps">PROPERTY</span> <span class="caps">NAME</span> x
| ✘
| ✔
|
| ✔
|
|
|
|-
! \P{x}
| <span class="caps">UNICODE</span> <span class="caps">PROPERTY</span> <span class="caps">NAME</span> not x
| ✘
| ✔
|
| ✔
|
|
|
|-
! \Q
| start of \Q … \E quoting
| ✘
| ✔
|
| ✔
|
|
|
|-
! \r
| <span class="caps">CARRIAGE</span> <span class="caps">RETURN</span>
| ✔
| ✔
|
| ✔
|
|
|
|-
! \s
| white space
| ✔
| ✔
|
| ✔
|
|
|
|-
! \S
| non white space
| ✔
| ✔
|
| ✔
|
|
|
|-
! \t
| <span class="caps">HORIZONTAL</span> <span class="caps">TAB</span>
|
| ✔
|
| ✔
|
|
|
|-
! \uhhhh
| U+hhhh (between U+0000 and U+FFFF)
| ✘
| ✘
|
|
|
|
|
|-
! \Uhhhhhhhh
| U+hhhhhhhh (between U+00000000 and U+0010FFFF)
| ✘
| ✘
|
|
|
|
|
|-
! \v
| <span class="caps">VERTICAL</span> <span class="caps">TAB</span>
| ✔
| ✔
|
| ✘
|
|
|
|-
! \w
| word character
| ✔
| ✔
|
| ✔
|
|
|
|-
! \W
| non word character
| ✔
| ✔
|
| ✔
|
|
|
|-
! \x{hhhh}
| U+hhhh
| ✘
| ✔ (0-10FFFF)
|
| ✔ (0-10FFFF)
|
|
|
|-
! \xhhhh
| U+hhhh
| ✔ (0000-<span class="caps">FFFF</span>)
| ✔ (00-FF)
|
| ✔ (00-FF)
|
|
|
|-
! \X
| grapheme cluster
| ✘
| ✘
|
|
|
|
|
|-
! \Z
| end of input (or before the final \n)
| ✘
| ✔
|
| ✔
|
|
|
|-
! \z
| end of input
| ✘
| ✔
|
| ✔
|
|
|
|-
! \n
| n-th backreference
| ✔
| ✔
|
|
|
|
|
|-
! \0ooo
| <span class="caps">ASCII</span>/Latin-1 character 0ooo
|
| ✔
|
|
|
|
|
|-
! .
| any character but newlines
| ✔
| ✔
|
|
|
|
|
|-
! ^
| line beginning
| ✔
|
|
| ✔
|
|
|
|-
! $
| line end
| ✔
| ✔
|
| ✔
|
|
|
|-
! \
| quote the following symbol
| ✔
| ✔
|
| ✔
|
|
|
|-
! [pattern]
| set
| ✔
| ✔
|
| ✔
|
|
|
|}


===Supported syntax: Operators===
= Benchmarks =
 
{| class="infotable line"
! Operator
!
! QRegExp
! <span class="caps">PCRE</span>
! V8
! <span class="caps">ICU</span>
! Boost.Regex
! std::regex
! RE2
|-
! *
| match 0 or more times
| ✔
| ✔
|
| ✔
| ?
| ?
| ✔
|-
! +
| match 1 or more times
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! ?
| match 0 or 1 times
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! {n}
| match n times
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! {n,}
| match n or more times
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! {n,m}
| match between n and m times
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! *?
| match 0 or more times, not greedy
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! <ins>? </ins>
| match 1 or more times, not greedy
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! ??
| match 0 or 1 times, not greedy
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! {n}?
| match n times
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! {n,}?
| match n or more times, not greedy
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! {n,m}?
| match between n and m times, not greedy
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! *
| match 0 or more times, possessive
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! ++
| match 1 or more times, possessive
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! ?+
| match 0 or 1 times, possessive
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! {n}+
| match n times
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! {n,}+
| match n or more times, possessive
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! {n,m}+
| match between n and m times, possessive
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! ( … )
| capturing group
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! (?: … )
| group
| ✔
| ✔
|
| ✔
|
|
| ✔
|-
! (?&gt; … )
| atomic grouping
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! (?# … )
| comment
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! (?= … )
| look-ahead assertion
| ✔
| ✔
|
| ✔
|
|
| ✘
|-
! (?! … )
| negative look-ahead assertion
| ✔
| ✔
|
| ✔
|
|
| ✘
|-
! (?&lt;= … )
| look-behind assertion
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! (?&lt;! … )
| negative look-behind assertion
| ✘
| ✔
|
| ✔
|
|
| ✘
|-
! (?flags: … )
| flags change
| ✘
| ✔
|
|
|
|
| ✔
|-
! (?flags)
| flags change
| ✘
| ✔
|
| ✔
|
|
| ✔
|-
! (?P&lt;name&gt; …)
| named capturing group
| ✘
| ✔
|
| ✘
|
|
| ✔
|-
! (?&lt;name&gt; …)
| named capturing group
| ✘
| ✔
|
| ✘
|
|
| ✘
|-
! (?‘name’ …)
| named capturing group
| ✘
| ✔
|
| ✘
|
|
| ✘
|-
! (?| …)
| branch reset
| ✘
| ✔
|
| ✘
|
|
| ✘
|-
!
|
|
|
|
|
|
|
|
|-
!
|
|
|
|
|
|
|
|
|}
 
===Supported syntax: flags===
 
=Benchmarks=


See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.


<sup>1</sup> It’s actually not possible to <span class="caps">UNSET</span> /s for QRegExp, i.e. making the dot not to match a newline.
<references />
 
<sup>2</sup> Cf. http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-August/000962.html
 
===Categories:===
 
* [[:Category:Developing Qt|Developing_Qt]]
** [[:Category:Developing Qt::Qt-Planning|Qt Planning]]
*** [[:Category:Developing Qt::Qt-Planning::Qt-Public-Roadmap|Qt Public Roadmap]]

Revision as of 09:26, 24 February 2015


[toc align_right="yes&quot; depth="3&quot;]

Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.
* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
* https://bugreports.qt.nokia.com/browse/QTBUG-20888

Current issues with QRegExp

From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html

High level issues

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd&quot;) for pattern "." returns 3, not 0
    T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries


h2. Feature matrix
|. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|. General comments|See above.| | | | | | |
|
. Already being used in Qt?| Yes | Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. | Yes (Qt 5) | libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. | No | No | No |
|. Pros | |Widely used, de-facto standard implementation for Perl-like regexps.| |Uses UTF-16 natively. | | |very fast, use a DFA|
|
. Cons | |Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2]. | |Boost does not give guarantees about ABI compatibility.| |uses UTF-8, doesn't have the lookbehind neither lookahead |
|. Fixes T1 | |✔| | ✔| | |✔ |
|
. Fixes T2 | |✔| | ✔| | |✔ |
|. Fixes T3 | |✔| |✔| | |? |
|
. Fixes T4 | |✔| |✘| | | ✘|
|. Fixes T5 | |?| |?| | |✘ |
|
. Fixes T6 | |✔ ("by hand&quot;, with partial matching)| |Maybe yes, see "UText&quot;:http://userguide.icu-project.org/strings/utext .| | |✔ see StringPiece |
|. Fixes T7 | |✔| |✔| | | ✘|
✘✔



h3. Supported syntax: Characters
|. |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|
. |BELL|✔|✔| |✔| | | |
|. |beginning of input| |✔| |✔| | | |
|
. inside a [set]|BACKSPACE|?|✔| | | | | |
|. outside a [set]|on a word boundary|✔|✔| | | | | |
|
. |not on a word boundary|✔|✔| |✔| | | |
|. |ASCII control character X|✘|✔| |✔| | | |
|
. |digit|✔|✔| |✔| | | |
|. |non digit|✔|✔| |✔| | | |
|
. |ESCAPE|✘|✔| |✔| | | |
|. |end of … quoting|✘|✔| |✔| | | |
|
. |FORM FEED|✔|✔| |✔| | | |
|. |end of previous match|✘|✔| |✔| | | |
|
. |LINE FEED|✔|✔| | | | | |
|. {x}|UNICODE CHARACTER NAME x|✘|✘| |✔| | | |
|
. {x}|UNICODE PROPERTY NAME x|✘|✔| |✔| | | |
|. {x}|UNICODE PROPERTY NAME not x|✘|✔| |✔| | | |
|
. |start of … quoting|✘|✔| |✔| | | |
|. |CARRIAGE RETURN|✔|✔| |✔| | | |
|
. |white space|✔|✔| |✔| | | |
|. |non white space|✔|✔| |✔| | | |
|
. |HORIZONTAL TAB|✔|✔| |✔| | | |
|. |U+hhhh (between U+0000 and U+FFFF)|✘|✘| | | | | |
|
. |U+hhhhhhhh (between U+00000000 and U+0010FFFF)|✘|✘| | | | | |
|. VERTICAL TAB|✔|✔| |✘| | | |
|
. |word character|✔|✔| |✔| | | |
|. |non word character|✔|✔| |✔| | | |
|
. {hhhh}|U+hhhh|✘|✔ (0-10FFFF)| |✔ (0-10FFFF)| | | |
|. |U+hhhh|✔ (0000-FFFF)|✔ (00-FF)| |✔ (00-FF)| | | |
|
. |grapheme cluster|✘|✘| | | | | |
|. |end of input (or before the final )|✘|✔| |✔| | | |
|
. |end of input|✘|✔| |✔| | | |
|. |n-th backreference|✔|✔| | | | | |
|
. ooo|ASCII/Latin-1 character 0ooo|✔|✔| | | | | |
|. .|any character but newlines|✔|✔| |✔| | | |
|
. ^|line beginning|✔|✔| |✔| | | |
|. $|line end|✔|✔| |✔| | | |
|
. quote the following symbol|✔|✔| |✔| | | |
|. [pattern]|set|✔|✔| |✔| | | |


h3. Supported syntax: Operators
|. Operator |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|
. * |match 0 or more times|✔|✔| |✔|?|?|✔|
|. + |match 1 or more times|✔|✔| |✔| | |✔|
|
. ? |match 0 or 1 times|✔|✔| |✔| | |✔|
|. {n} |match n times|✔|✔| |✔| | |✔|
|
. {n,} |match n or more times|✔|✔| |✔| | |✔|
|. {n,m} |match between n and m times|✔|✔| |✔| | |✔|
|
. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔|
|. ? |match 1 or more times, not greedy|✘|✔| |✔| | |✔|
|. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔|
|
. {n}? |match n times|✘|✔| |✔| | |✔|
|. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔|
|
. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔|
|. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘|
|
.+ |match 1 or more times, possessive|✘|✔| |✔| | |✘|
|
. ? |match 0 or 1 times, possessive|✘|✔| |✔| | |✘|
|
. {n}+ |match n times|✘|✔| |✔| | |✘|
|. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘|
|
. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘|
|. ( … ) |capturing group|✔|✔| |✔| | |✔|
|
. (?: … ) |group|✔|✔| |✔| | |✔|
|. (?&gt; … ) |atomic grouping|✘|✔| |✔| | |✘|
|
. (?# … ) |comment|✘|✔| |✔| | |✘|
|. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘|
|
. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘|
|. (?<= … ) |look-behind assertion|✘|✔| |✔| | |✘|
|
. (?<! … ) |negative look-behind assertion|✘|✔| |✔| | |✘|
|. (?flags: … ) |flags change|✘|✔| | | | |✔|
|
. (?flags) |flags change|✘|✔| |✔| | |✔|
|. (?P&lt;name&gt; …) |named capturing group|✘|✔| |✘| | |✔|
|
. (?<name&gt; …) |named capturing group|✘|✔| |✘| | |✘|
|. (?'name' …) |named capturing group|✘|✔| |✘| | |✘|
|
. (?&#x7c; …) |branch reset|✘|✔| |✘| | |✘|
|. | | | | | | | | |
|
. | | | | | | | | |


h3. Supported syntax: flags
|. Flag |. |. QRegExp |. PCRE|. V8|. ICU|. Boost.Regex|. std::regex|. RE2|
|
. /i |case insensitive|✔|✔| |✔| | |✔|
|. /m |multi-line|✘|✔| |✔| | |✔|
|
. /s |dot matches anything|~[1] |✔| |✔| | |✔|
|. /x |ignore whitespace and comments|✘|✔| |✔| | |✔|
|_. /U |minimal match|✔|✔| |✘| | |✔|

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.