Regexp engine in Qt5

From Qt Wiki
Jump to: navigation, search
This article may require cleanup to meet the Qt Wiki's quality standards. Reason: Auto-imported from ExpressionEngine.
Please improve this article if you can. Remove the {{cleanup}} tag and add this page to Updated pages list after it's clean.


Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
    • Non-greedy quantifiers (see page 141 titled "- 129 -")

' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.

  • Patternist/XPath also needs Regex features not found in QRegExp, including

' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )

  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem
  • Customer complained about QRegExp (though I don't see what's their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern "." returns 3, not 0

T6: only linear input is supported, for a text editor like Kate this does not scale

  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

Feature matrix

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
See above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Does not run on every platform supported by QtCore / QtBase[2]. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn't have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ ("by hand", with partial matching) Maybe yes, see UText . ✔ see StringPiece
Fixes T7

✘✔


Supported syntax: Characters

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
BELL
beginning of input
inside a [set] BACKSPACE ?
outside a [set] on a word boundary
not on a word boundary
ASCII control character X
digit
non digit
ESCAPE
end of … quoting
FORM FEED
end of previous match
LINE FEED
{x} UNICODE CHARACTER NAME x
{x} UNICODE PROPERTY NAME x
{x} UNICODE PROPERTY NAME not x
start of … quoting
CARRIAGE RETURN
white space
non white space
HORIZONTAL TAB
U+hhhh (between U+0000 and U+FFFF)
U+hhhhhhhh (between U+00000000 and U+0010FFFF)
VERTICAL TAB
word character
non word character
{hhhh} U+hhhh ✔ (0-10FFFF) ✔ (0-10FFFF)
U+hhhh ✔ (0000-FFFF) ✔ (00-FF) ✔ (00-FF)
grapheme cluster
end of input (or before the final )
end of input
n-th backreference
ooo ASCII/Latin-1 character 0ooo
. any character but newlines
^ line beginning
$ line end
quote the following symbol
[pattern] set


Supported syntax: Operators

Operator QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
* match 0 or more times ? ?
+ match 1 or more times
 ? match 0 or 1 times
{n} match n times
{n,} match n or more times
{n,m} match between n and m times
*? match 0 or more times, not greedy
? match 1 or more times, not greedy
 ?? match 0 or 1 times, not greedy
{n}? match n times
{n,}? match n or more times, not greedy
{n,m}? match between n and m times, not greedy
*+ match 0 or more times, possessive
+ match 1 or more times, possessive
 ? match 0 or 1 times, possessive
{n}+ match n times
{n,}+ match n or more times, possessive
{n,m}+ match between n and m times, possessive
( … ) capturing group
(?: … ) group
(?> … ) atomic grouping
(?# … ) comment
(?= … ) look-ahead assertion
(?! … ) negative look-ahead assertion
(?<= … ) look-behind assertion
(?<! … ) negative look-behind assertion
(?flags: … ) flags change
(?flags) flags change
(?P<name> …) named capturing group
(?<name> …) named capturing group
(?'name' …) named capturing group
…) branch reset


Supported syntax: flags

Flag QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
/i case insensitive
/m multi-line
/s dot matches anything ~[1]
/x ignore whitespace and comments
/U minimal match

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.