Regexp engine in Qt5: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
(Tables)
(Removed the cleanup tag as the page is now correct)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Cleanup | reason=Auto-imported from ExpressionEngine.}}
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]]
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]]


Line 7: Line 5:


This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.
This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.
* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
* http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html
* https://bugreports.qt.nokia.com/browse/QTBUG-20888
* https://bugreports.qt.io/browse/QTBUG-20888


== Current issues with QRegExp ==
== Current issues with QRegExp ==


From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html


=== High level issues ===
=== High level issues ===
Line 19: Line 17:
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm ECMA-262-1999] . Missing features include
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm ECMA-262-1999] . Missing features include
** Non-greedy quantifiers (see page 141 titled "- 129 -")
** Non-greedy quantifiers (see page 141 titled "- 129 -")
'''''' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
** But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
* Patternist/XPath also needs Regex features not found in QRegExp, including
* Patternist/XPath also needs Regex features not found in QRegExp, including
'''''' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] …
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] …
* Customer complained about QRegExp (though I don't see what's their exact problem):
* Customer complained about QRegExp (though I don't see what's their exact problem):
Line 37: Line 35:
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T4''': lookbehind is not supported (lookahead is)
* '''T4''': lookbehind is not supported (lookahead is)
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".'''" returns 3, not 0
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
''' '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.


Line 68: Line 66:
! RE2
! RE2
|-
|-
! General comments|See above.
! General comments
| See above.
|  
|  
|  
|  
Line 183: Line 182:
! RE2
! RE2
|-
|-
!  
! \a
|BELL
|BELL
|✔
|✔
Line 193: Line 192:
|  
|  
|-
|-
!  
! \A
|beginning of input
|beginning of input
|  
|  
Line 203: Line 202:
|  
|  
|-
|-
! inside a [set]
! \b inside a [set]
|BACKSPACE
|BACKSPACE
|?
|?
Line 213: Line 212:
|  
|  
|-
|-
! outside a [set]
! \b outside a [set]
|on a word boundary
|on a word boundary
|✔
|✔
Line 223: Line 222:
|  
|  
|-
|-
!  
! \B
|not on a word boundary
|not on a word boundary
|✔
|✔
Line 233: Line 232:
|  
|  
|-
|-
!  
! \cX
|ASCII control character X
|ASCII control character X
|✘
|✘
Line 243: Line 242:
|  
|  
|-
|-
!  
! \d
|digit
|digit
|✔
|✔
Line 253: Line 252:
|  
|  
|-
|-
!  
! \D
|non digit
|non digit
|✔
|✔
Line 263: Line 262:
|  
|  
|-
|-
!  
! \e
|ESCAPE
|ESCAPE
|✘
|✘
Line 273: Line 272:
|  
|  
|-
|-
!  
! \E
|end of … quoting
|end of \Q \E quoting
|✘
|✘
|✔
|✔
Line 283: Line 282:
|  
|  
|-
|-
!  
! \f
|FORM FEED
|FORM FEED
|✔
|✔
Line 293: Line 292:
|  
|  
|-
|-
!  
! \G
|end of previous match
|end of previous match
|✘
|✘
Line 303: Line 302:
|  
|  
|-
|-
!  
! \n
|LINE FEED
|LINE FEED
|✔
|✔
Line 313: Line 312:
|  
|  
|-
|-
! {x}
! \N{x}
|UNICODE CHARACTER NAME x
|UNICODE CHARACTER NAME x
|✘
|✘
Line 323: Line 322:
|  
|  
|-
|-
! {x}
! \p{x}
|UNICODE PROPERTY NAME x
|UNICODE PROPERTY NAME x
|✘
|✘
Line 333: Line 332:
|  
|  
|-
|-
! {x}
! \P{x}
|UNICODE PROPERTY NAME not x
|UNICODE PROPERTY NAME not x
|✘
|✘
Line 343: Line 342:
|  
|  
|-
|-
!  
! \Q
|start of … quoting
|start of \Q \E quoting
|✘
|✘
|✔
|✔
Line 353: Line 352:
|  
|  
|-
|-
!  
! \r
|CARRIAGE RETURN
|CARRIAGE RETURN
|✔
|✔
Line 363: Line 362:
|  
|  
|-
|-
!  
! \s
|white space
|white space
|✔
|✔
Line 373: Line 372:
|  
|  
|-
|-
!  
! \S
|non white space
|non white space
|✔
|✔
Line 383: Line 382:
|  
|  
|-
|-
!  
! \t
|HORIZONTAL TAB
|HORIZONTAL TAB
|✔
|✔
Line 393: Line 392:
|  
|  
|-
|-
!  
! \uhhhh
|U+hhhh (between U+0000 and U+FFFF)
|U+hhhh (between U+0000 and U+FFFF)
|✘
|✘
Line 403: Line 402:
|  
|  
|-
|-
!  
! \Uhhhhhhhh
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|✘
|✘
Line 413: Line 412:
|  
|  
|-
|-
! VERTICAL TAB
! \v
| VERTICAL TAB
|✔
|✔
|✔
|✔
Line 422: Line 422:
|  
|  
|-
|-
!  
! \w
|word character
|word character
|✔
|✔
Line 432: Line 432:
|  
|  
|-
|-
!  
! \W
|non word character
|non word character
|✔
|✔
Line 442: Line 442:
|  
|  
|-
|-
! {hhhh}
! \x{hhhh}
|U+hhhh
|U+hhhh
|✘
|✘
Line 452: Line 452:
|  
|  
|-
|-
!  
! \xhhhh
|U+hhhh
|U+hhhh
|✔ (0000-FFFF)
|✔ (0000-FFFF)
Line 462: Line 462:
|  
|  
|-
|-
!  
! \X
|grapheme cluster
|grapheme cluster
|✘
|✘
Line 472: Line 472:
|  
|  
|-
|-
!  
! \Z
|end of input (or before the final )
|end of input (or before the final )
|✘
|✘
Line 482: Line 482:
|  
|  
|-
|-
!  
! \z
|end of input
|end of input
|✘
|✘
Line 492: Line 492:
|  
|  
|-
|-
!  
! \n
|n-th backreference
|n-th backreference
|✔
|✔
Line 502: Line 502:
|  
|  
|-
|-
! ooo
! \0ooo
|ASCII/Latin-1 character 0ooo
|ASCII/Latin-1 character 0ooo
|✔
|✔
Line 542: Line 542:
|  
|  
|-
|-
! quote the following symbol
! \
|quote the following symbol
|✔
|✔
|✔
|✔
Line 561: Line 562:
|  
|  
|}
|}


=== Supported syntax: Operators ===
=== Supported syntax: Operators ===

Latest revision as of 15:41, 22 May 2015


Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
    • Non-greedy quantifiers (see page 141 titled "- 129 -")
    • But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
  • Patternist/XPath also needs Regex features not found in QRegExp, including
  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem
  • Customer complained about QRegExp (though I don't see what's their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
  • T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

Feature matrix

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
General comments See above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Does not run on every platform supported by QtCore / QtBase[2]. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn't have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ ("by hand", with partial matching) Maybe yes, see UText . ✔ see StringPiece
Fixes T7

✘✔


Supported syntax: Characters

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
\a BELL
\A beginning of input
\b inside a [set] BACKSPACE ?
\b outside a [set] on a word boundary
\B not on a word boundary
\cX ASCII control character X
\d digit
\D non digit
\e ESCAPE
\E end of \Q … \E quoting
\f FORM FEED
\G end of previous match
\n LINE FEED
\N{x} UNICODE CHARACTER NAME x
\p{x} UNICODE PROPERTY NAME x
\P{x} UNICODE PROPERTY NAME not x
\Q start of \Q … \E quoting
\r CARRIAGE RETURN
\s white space
\S non white space
\t HORIZONTAL TAB
\uhhhh U+hhhh (between U+0000 and U+FFFF)
\Uhhhhhhhh U+hhhhhhhh (between U+00000000 and U+0010FFFF)
\v VERTICAL TAB
\w word character
\W non word character
\x{hhhh} U+hhhh ✔ (0-10FFFF) ✔ (0-10FFFF)
\xhhhh U+hhhh ✔ (0000-FFFF) ✔ (00-FF) ✔ (00-FF)
\X grapheme cluster
\Z end of input (or before the final )
\z end of input
\n n-th backreference
\0ooo ASCII/Latin-1 character 0ooo
. any character but newlines
^ line beginning
$ line end
\ quote the following symbol
[pattern] set

Supported syntax: Operators

Operator QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
* match 0 or more times ? ?
+ match 1 or more times
? match 0 or 1 times
{n} match n times
{n,} match n or more times
{n,m} match between n and m times
*? match 0 or more times, not greedy
? match 1 or more times, not greedy
?? match 0 or 1 times, not greedy
{n}? match n times
{n,}? match n or more times, not greedy
{n,m}? match between n and m times, not greedy
*+ match 0 or more times, possessive
+ match 1 or more times, possessive
? match 0 or 1 times, possessive
{n}+ match n times
{n,}+ match n or more times, possessive
{n,m}+ match between n and m times, possessive
( … ) capturing group
(?: … ) group
(?> … ) atomic grouping
(?# … ) comment
(?= … ) look-ahead assertion
(?! … ) negative look-ahead assertion
(?<= … ) look-behind assertion
(?<! … ) negative look-behind assertion
(?flags: … ) flags change
(?flags) flags change
(?P<name> …) named capturing group
(?<name> …) named capturing group
(?'name' …) named capturing group
…) branch reset


Supported syntax: flags

Flag QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
/i case insensitive
/m multi-line
/s dot matches anything ~[1]
/x ignore whitespace and comments
/U minimal match

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.