Regexp engine in Qt5: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
m (Fixed general comments line in table Feature matrix)
(Fixed table in Supported syntax: Characters)
Line 184: Line 184:
! RE2
! RE2
|-
|-
!  
! \a
|BELL
|BELL
|✔
|✔
Line 194: Line 194:
|  
|  
|-
|-
!  
! \A
|beginning of input
|beginning of input
|  
|  
Line 204: Line 204:
|  
|  
|-
|-
! inside a [set]
! \b inside a [set]
|BACKSPACE
|BACKSPACE
|?
|?
Line 214: Line 214:
|  
|  
|-
|-
! outside a [set]
! \b outside a [set]
|on a word boundary
|on a word boundary
|✔
|✔
Line 224: Line 224:
|  
|  
|-
|-
!  
! \B
|not on a word boundary
|not on a word boundary
|✔
|✔
Line 234: Line 234:
|  
|  
|-
|-
!  
! \cX
|ASCII control character X
|ASCII control character X
|✘
|✘
Line 244: Line 244:
|  
|  
|-
|-
!  
! \d
|digit
|digit
|✔
|✔
Line 254: Line 254:
|  
|  
|-
|-
!  
! \D
|non digit
|non digit
|✔
|✔
Line 264: Line 264:
|  
|  
|-
|-
!  
! \e
|ESCAPE
|ESCAPE
|✘
|✘
Line 274: Line 274:
|  
|  
|-
|-
!  
! \E
|end of … quoting
|end of \Q \E quoting
|✘
|✘
|✔
|✔
Line 284: Line 284:
|  
|  
|-
|-
!  
! \f
|FORM FEED
|FORM FEED
|✔
|✔
Line 294: Line 294:
|  
|  
|-
|-
!  
! \G
|end of previous match
|end of previous match
|✘
|✘
Line 304: Line 304:
|  
|  
|-
|-
!  
! \n
|LINE FEED
|LINE FEED
|✔
|✔
Line 314: Line 314:
|  
|  
|-
|-
! {x}
! \N{x}
|UNICODE CHARACTER NAME x
|UNICODE CHARACTER NAME x
|✘
|✘
Line 324: Line 324:
|  
|  
|-
|-
! {x}
! \p{x}
|UNICODE PROPERTY NAME x
|UNICODE PROPERTY NAME x
|✘
|✘
Line 334: Line 334:
|  
|  
|-
|-
! {x}
! \P{x}
|UNICODE PROPERTY NAME not x
|UNICODE PROPERTY NAME not x
|✘
|✘
Line 344: Line 344:
|  
|  
|-
|-
!  
! \Q
|start of … quoting
|start of \Q \E quoting
|✘
|✘
|✔
|✔
Line 354: Line 354:
|  
|  
|-
|-
!  
! \r
|CARRIAGE RETURN
|CARRIAGE RETURN
|✔
|✔
Line 364: Line 364:
|  
|  
|-
|-
!  
! \s
|white space
|white space
|✔
|✔
Line 374: Line 374:
|  
|  
|-
|-
!  
! \S
|non white space
|non white space
|✔
|✔
Line 384: Line 384:
|  
|  
|-
|-
!  
! \t
|HORIZONTAL TAB
|HORIZONTAL TAB
|✔
|✔
Line 394: Line 394:
|  
|  
|-
|-
!  
! \uhhhh
|U+hhhh (between U+0000 and U+FFFF)
|U+hhhh (between U+0000 and U+FFFF)
|✘
|✘
Line 404: Line 404:
|  
|  
|-
|-
!  
! \Uhhhhhhhh
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|✘
|✘
Line 414: Line 414:
|  
|  
|-
|-
! VERTICAL TAB
! \v
| VERTICAL TAB
|✔
|✔
|✔
|✔
Line 423: Line 424:
|  
|  
|-
|-
!  
! \w
|word character
|word character
|✔
|✔
Line 433: Line 434:
|  
|  
|-
|-
!  
! \W
|non word character
|non word character
|✔
|✔
Line 443: Line 444:
|  
|  
|-
|-
! {hhhh}
! \x{hhhh}
|U+hhhh
|U+hhhh
|✘
|✘
Line 453: Line 454:
|  
|  
|-
|-
!  
! \xhhhh
|U+hhhh
|U+hhhh
|✔ (0000-FFFF)
|✔ (0000-FFFF)
Line 463: Line 464:
|  
|  
|-
|-
!  
! \X
|grapheme cluster
|grapheme cluster
|✘
|✘
Line 473: Line 474:
|  
|  
|-
|-
!  
! \Z
|end of input (or before the final )
|end of input (or before the final )
|✘
|✘
Line 483: Line 484:
|  
|  
|-
|-
!  
! \z
|end of input
|end of input
|✘
|✘
Line 493: Line 494:
|  
|  
|-
|-
!  
! \n
|n-th backreference
|n-th backreference
|✔
|✔
Line 503: Line 504:
|  
|  
|-
|-
! ooo
! \0ooo
|ASCII/Latin-1 character 0ooo
|ASCII/Latin-1 character 0ooo
|✔
|✔
Line 543: Line 544:
|  
|  
|-
|-
! quote the following symbol
! \
|quote the following symbol
|✔
|✔
|✔
|✔
Line 562: Line 564:
|  
|  
|}
|}


=== Supported syntax: Operators ===
=== Supported syntax: Operators ===

Revision as of 15:33, 22 May 2015

This article may require cleanup to meet the Qt Wiki's quality standards. Reason: Auto-imported from ExpressionEngine.
Please improve this article if you can. Remove the {{cleanup}} tag and add this page to Updated pages list after it's clean.


Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
    • Non-greedy quantifiers (see page 141 titled "- 129 -")
    • But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
  • Patternist/XPath also needs Regex features not found in QRegExp, including
  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem
  • Customer complained about QRegExp (though I don't see what's their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
  • T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

Feature matrix

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
General comments See above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Does not run on every platform supported by QtCore / QtBase[2]. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn't have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ ("by hand", with partial matching) Maybe yes, see UText . ✔ see StringPiece
Fixes T7

✘✔


Supported syntax: Characters

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
\a BELL
\A beginning of input
\b inside a [set] BACKSPACE ?
\b outside a [set] on a word boundary
\B not on a word boundary
\cX ASCII control character X
\d digit
\D non digit
\e ESCAPE
\E end of \Q … \E quoting
\f FORM FEED
\G end of previous match
\n LINE FEED
\N{x} UNICODE CHARACTER NAME x
\p{x} UNICODE PROPERTY NAME x
\P{x} UNICODE PROPERTY NAME not x
\Q start of \Q … \E quoting
\r CARRIAGE RETURN
\s white space
\S non white space
\t HORIZONTAL TAB
\uhhhh U+hhhh (between U+0000 and U+FFFF)
\Uhhhhhhhh U+hhhhhhhh (between U+00000000 and U+0010FFFF)
\v VERTICAL TAB
\w word character
\W non word character
\x{hhhh} U+hhhh ✔ (0-10FFFF) ✔ (0-10FFFF)
\xhhhh U+hhhh ✔ (0000-FFFF) ✔ (00-FF) ✔ (00-FF)
\X grapheme cluster
\Z end of input (or before the final )
\z end of input
\n n-th backreference
\0ooo ASCII/Latin-1 character 0ooo
. any character but newlines
^ line beginning
$ line end
\ quote the following symbol
[pattern] set

Supported syntax: Operators

Operator QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
* match 0 or more times ? ?
+ match 1 or more times
? match 0 or 1 times
{n} match n times
{n,} match n or more times
{n,m} match between n and m times
*? match 0 or more times, not greedy
? match 1 or more times, not greedy
?? match 0 or 1 times, not greedy
{n}? match n times
{n,}? match n or more times, not greedy
{n,m}? match between n and m times, not greedy
*+ match 0 or more times, possessive
+ match 1 or more times, possessive
? match 0 or 1 times, possessive
{n}+ match n times
{n,}+ match n or more times, possessive
{n,m}+ match between n and m times, possessive
( … ) capturing group
(?: … ) group
(?> … ) atomic grouping
(?# … ) comment
(?= … ) look-ahead assertion
(?! … ) negative look-ahead assertion
(?<= … ) look-behind assertion
(?<! … ) negative look-behind assertion
(?flags: … ) flags change
(?flags) flags change
(?P<name> …) named capturing group
(?<name> …) named capturing group
(?'name' …) named capturing group
…) branch reset


Supported syntax: flags

Flag QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
/i case insensitive
/m multi-line
/s dot matches anything ~[1]
/x ignore whitespace and comments
/U minimal match

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.