QtWebEngine/ScriptsAndExtensions

From Qt Wiki
Jump to navigation Jump to search

This page is about the internals of script injection and extensions in Qt WebEngine and Chromium. Qt WebEngine currently does not provide any API for extensions, although internally some code exists for supporting the PDF viewer extension bundled with Chromium.

Introduction to Script Injection

Script injection is the process of executing JavaScript code in the context of a web page in addition to the standard <script> HTML tags. Often the injection of CSS stylesheets is also included under the rubric of script injection. Script injection is useful for modifying the behavior of pages which are not bundled with or generated by the application itself. In the world of open-source WebEngine-based applications, the browser galacteek, for example, uses script injection to support the use of the peer-to-peer protocol IPFS instead of HTTP. The e-book manager calibre uses it to implement its e-book reading UI. Another open-source browser webmacs injects code for its emacs-style text editing functionality. Conversely, script injection is not particularly useful for hybrid UIs, for example, since they can simply include the scripts from their HTML in the usual way.

To prevent the ordinary scripts on the page from accidentally or maliciously interfering with the injected code, the injected scripts are usually executed in an isolated world, while ordinary scripts are executed in the main world. Scripts in different worlds share the same DOM data structure, including the Document object and all the HTML elements, but cannot see or modify each other's JavaScript variables, functions, etc. Communication between worlds is possible only through message-passing via DOM events. Qt WebEngine does allow injection also into the main world if interfering with the ordinary scripts happens to be the goal.

In principle, script injection is the simple matter of reacting to DOM loading events from the browser engine by passing some strings with JavaScript code to the JavaScript engine. This is how script injection works in the APIs of Qt WebKit and Blink, for example. However, this "simple matter" is not so simple in a multi-process architecture, particularly because it allows for very little parallelism. For this reason, it is preferable to employ a more declarative API where rules describing which scripts to inject when and under what condition are set up beforehand. The injection logic, as described by these rules, can be communicated to each subprocess where it can then be executed independently, without having to constantly wait for the main process.

Two major types of use cases can be distinguished. On the one hand, applications can use script injection to implement a specific feature, such as emacs-style text editing. In this case, the use of injection is an internal implementation detail of the application that is invisible to the end-user. On the other hand, applications can provide extension mechanisms to allow end-users to modify the behavior web pages via script injection. In this case, the application must provide a stable public interface for script injection; many such interfaces already exist, including Greasemonkey's user scripts, Stylish' user styles, and Chromium's extensions. This is, of course, only a distinction on use cases, and does not necessarily apply cleanly to the implementation, because the implementations of these extension mechanisms can be generally useful even if used only internally. Chromium, for example, uses extensions internally to support the PDF viewer, the media router, and other features. Likewise, Qt WebEngine implements some support for Chromium extensions in order to use the PDF viewer, without providing a public API for installing or managing extensions, etc.

Introduction to User Scripts, User Styles, and Extensions

User scripts, invented in the context of the Greasemonkey extension for Firefox, allow end-users to inject their own scripts into web sites in order to customize their behavior. These are JavaScript files that begin with a special comment block containing metadata for describing the script and controlling which web sites it should affect. User styles were also first created in a Firefox extension, this time called Stylish. These are CSS files, again with a special header, that can be used to customize a web site's appearance.

Chromium supports user scripts natively by automatically converting them into extensions when the user clicks on a link that ends with .user.js. However, there is also the extension Tampermonkey, which implements user scripts without this conversion, and provides a more complete feature set along with a specialized user interface. User styles are not natively supported by Chromium but can be used via extensions such as Stylish and Stylus.

Support for user scripts was added to Qt WebEngine in 2016. This is not a full implementation of the Greasemonkey spec, but only of the part that cannot be implemented in terms of the existing API, namely conditional injection using pattern matching rules. This feature is used by the open-source browsers qutebrowser and falkon to implement the full Greasemonkey spec. Both browsers use their own Greasemonkey metadata parsers to extract the metadata, then pass the same script along to Qt WebEngine for a second parsing.

In general, user scripts are not a particularly popular feature, since they are mainly aimed at advanced users; their main advantage is that they are easy to create and are portable across browsers. Extensions, on the other hand, can have their own user interface elements and are overall a much more feature-rich mechanism for extending the browser's functionality.

Physically, extensions consist of a manifest file along with HTML, CSS, JavaScript, and image files. Logically, extensions consist mainly of a background script, content scripts, and various sorts of user interface elements. What follows is a brief overview of the parts of an extension.

Permissions

Extensions must declare a list of permission keywords in their manifest. These must be presented to the user for approval during installation. Additionally, there is a list of optional permissions, which can be requested programmatically.

The Background Script

At the core of an extension is the background script. This is the central brain of the extension that is responsible for reacting to events from the browser and other parts of the extension; events such as open or closing tabs, navigations, downloads, clicks on the extension's icon, etc. The background script is executed in the context of the extension's background page, which is a lot like an ordinary web page, except with special privileges and invisible. The background page can be lazy or persistent. A lazy background page is suspended as soon as possible and awakened only when events are fired that the background script has registered an interest in. A persistent background page is always kept around in memory and is intended to be used with the webRequest API for intercepting network traffic.

In manifest v3, the next version of Chromium extension architecture that is planned for 2020, the background script will instead run as a service worker. Unlike the lazy background page, which is only suspended between events and retains its JavaScript state, the background service worker will be killed after an event is handled and no state will be retained (except for the persistent storage API, of course). The option for persistent background pages will not be available for manifest v3 extensions, although it seems that manifest v2 extensions will continue to be supported.

Content Scripts

For injecting JavaScript and CSS into a page, an extension can contain content scripts. Each content script is injected into an isolated world reserved for the particular extension. Content scripts can be injected declaratively or programmatically. In the declarative case, the scripts are declared in the manifest file and must contain pattern matching rules for conditional injection. In the programmatic case, the extension can use the tabs API to inject arbitrary JavaScript or CSS into whichever tab it wants. If the user clicked on the extension's icon or otherwise interacted with it, then the extension automatically gains the activeTab permission which allows it to inject scripts into the current tab. Otherwise, the extension will need to declare a requirement on the tabs permission to inject scripts.

User Interface Elements

As for user interface elements, these consist of browser actions and page actions with popup HTML pages, context menus, omnibox keywords with optional icons, override pages that can replace the new tab page and others, keyboard shortcuts that can potentially be global, devtools extensions for adding new tabs to devtools, and an options page that can be a separate page or a popup in the chrome://extensions page. Browser and page actions refer to the extension icon displayed in Chromium; in manifest v3, these two will be merged into just one type of action. Actions have a lot of features, like rendering the icon with an HTML5 canvas, but most importantly they can open a popup page, which is an HTML file declared in the manifest and rendered in the extension process. The popup page can communicate with the background page via messages and storage API, likewise with the options page.

Incognito Mode

Concerning incognito mode, the extension can be spanned or split. In split mode, the extension will have a separate process for incognito windows, and pages in the two processes cannot communicate except through the storage API, nor access each other's tabs or other browser features. Conversely, in spanned mode, the extension will share its pages between incognito and ordinary windows.

Script Injection in the Bowels of Blink

Before descending into the depths of Chromium, it might be a good idea to recall how script injection worked in Qt WebKit. The matter is much simpler in a single-process architecture, and, since Blink is a fork of WebKit, the APIs are naturally similar.

To inject code into a page with WebKit, one only has to call QWebFrame::evaluateJavaScript on the main frame of a page, passing it a string with JavaScript code. Now, this injects the code only once and only for the main frame. If the goal is to inject code into every subframe, then one would have to recurse down the frame tree using childFrames and call evaluateJavaScript for each frame. If the goal is to inject the code into each frame of each newly loaded page, then one would have to take into account that new frames may be created during navigation, and new documents installed into existing frames. Thus, one would have to listen for the QWebPage::frameCreated signal to catch new frames. Then, for each frame, one would have to listen for a signal such as QWebFrame::javaScriptWindowObjectCleared or loadFinished to inject the code into each new document.

As said, script injection with Blink works quite similarly to the Qt WebKit API. The major difference is that the frame tree lives in a subprocess, or it might even be distributed across a number of different subprocesses. To inject code into a page with Blink, one can call the ExecuteScript or ExecuteScriptInIsolatedWorld methods of a blink::WebLocalFrame in a subprocess, passing once again a string with JavaScript code. Monitoring the creation of new frames has to be done in the main process by impementing a content::WebContentsObserver and overriding the RenderFrameCreated method. Monitoring the frame itself has to be done in the corresponding subprocess by implementing a content::RenderFrameObserver and overriding methods like DidFinishDocumentLoad to re-inject code for new documents.

The ExecuteScript* methods themselves simply pass the JavaScript string to V8. The main complication here is the matter of isolated worlds, which should not be confused with the concept of a V8 isolate; these are entirely different things. An isolate in V8 (v8::Isolate) is essentially one instance of the V8 virtual machine, complete with its own garbage collected heap and its own stack. Objects from one isolate cannot interact in any way with objects from another isolate. A context in V8 (v8::Context) is what allows unrelated JavaScript programs to run inside a single isolate. Each context has a designated global object which stores the global variables of the script and serves as the outermost scope for identifier lookup providing access to builtin functions through its prototype chain. Objects from different contexts can, by default, interact with one another, however, if a security token is set on the context, then the tokens must match for the interaction to be allowed.

Blink uses V8 contexts to implement isolated worlds, but also to separate JavaScript of different frames, since JavaScript of different frames is still executed all within one isolate (held in a global variable in blink::V8PerIsolateData). Each frame has its own DOM tree, including its own DOM Window object (blink::DOMWindow, blink::LocalDOMWindow), however the DOM objects are wrapped in proxy objects (blink::ScriptWrappable) and each proxy object of one DOM object lives in a different world. The DOM objects themselves contain each a pointer to the main world proxy (blink::ScriptWrappable::main_world_wrapper_), while the map from DOM objects to isolated world proxies is maintained separatedly (blink::DOMWrapperWorld). Each world, main or isolated, gets its own V8 context (blink::LocalWindowProxy::CreateContext). The global object of the V8 context of a world is a proxy object (blink::WindowProxy, blink::LocalWindowProxy) to the one DOM Window object of the frame. The DOM Window object is also called an execution context (blink::ExecutionContext is a supertype of blink::LocalDOMWindow), meaning that each frame has one execution context, which corresponds to multiple V8 contexts. The set of window proxies is maintained by blink::WindowProxyManager, which is contained by blink::ScriptController, which is contained by blink::LocalFrame, which is contained by blink::WebLocalFrame, which is contained by content::RenderFrame… Because the V8 contexts of each isolated world have separate global objects, scripts in different worlds cannot see each others global variables or functions. Because the V8 contexts share the isolate, the security origin, and the execution context, scipts can still pass objects to one another through DOM events.

Aside from the methods ExecuteScript and ExecuteScriptInIsolatedWorld of blink::WebLocalFrame, there are also the asynchronous variants, RequestExecute*, which use blink::PausableScriptExecutor to execute scripts asynchronously while optionally delaying the window.onload event until after their execution finished. This is supposed to reduce UI jank.

Script Injection in Qt WebEngine

Qt WebEngine essentially retains the evaluateJavaScript method from Qt WebKit, calling it QWebEnginePage::runJavaScript instead, however it is significantly less useful in practice due to two limitations. First, since the frame tree is no longer exposed in WebEngine, runJavaScript can only inject code for the main frame of a page. There is no possibility in WebEngine to inject code into the subframes of an already loaded page. Second, since navigation is carried out asynchronously in a subprocess, runJavaScript provides no ordering guarantees relative to any navigation or loading events. If the application calls runJavaScript at the wrong time during loading, the script might simply not be executed, or it might be executed before the window object is cleared, meaning its results, modifications, etc, are effectively discarded. Both of these limitations can be seen as arising from the multi-process architecture underlying WebEngine, although, by the example of Chromium's programmatic injection API, it can also be seen that these limitations are not absolutely necessary and can be overcome.

In principle, the same style of API as in Qt WebKit could also be implemented for Qt WebEngine, however, this would come with significant consequences for performance. If we view the interaction between the application and Qt WebKit as communication between two independent logical processes, then we can see that their communication protocol has the form of an interactive back-and-forth dialogue. WebKit notifies the application when events such as javaScriptWindowObjectCleared occur and waits for the application to respond. The application sends JavaScript code to WebKit via evaluateJavaScript and waits for WebKit respond. The higher the communication overhead, the less efficient this protocol becomes since more and more time will be spent simply on waiting for the other party to reply. No problem for WebKit, since the overhead of a function call is as low as it gets, but the situation changes as soon as the two logical processes are really put into separate OS processes where the "messages" are no longer function calls but actual IPC messages.

Now, there is a general solution to this sort of problem. Instead of a flurry of little back-and-forth messages, where the application is notified of events and sends back commands, the application must declare upfront what it wants to happen in reaction to which event; then no communication is needed, and therefore there is also no problem with communication overhead. This is the difference, for example, between client-side and server-side filtering in a database query: in client-side filtering the database sends each row to the application who then implements the filtering predicate in its native programming language, while in server-side filtering the application encodes the filtering predicate in some domain-specific language and hands it off to the server who can then send back only the matching rows. Likewise, with script injection, we can employ a declarative approach where the application decides upfront which scripts should be injected in reaction to which loading events, encodes this in some data structure or other, and hands it off to WebEngine. This allows the main process to distribute the serialized injection logic ahead of time to all its subprocesses and each subprocess can then carry out the injections independently, that is, without waiting on the main process.

Naturally, the most flexible way to encode logic as data is to use a Turing-complete scripting language. Perhaps the application could upload JavaScript code to WebEngine; JavaScript code that then injects other JavaScript code into pages. Overkill to be sure, but it is worth noting that this is actually something that Chromium extensions can do, that is, they can inject JavaScript code from JavaScript code via Chromium's programmatic script injection API. The actual approach taken in WebEngine is much simpler. To inject code into a page with WebEngine, the code must be wrapped in a script object (QWebEngineScript or QML WebEngineScript) and the appropriate attributes, such as the injection point and isolated world identifier, must be set. The script object is then handed over to the profile or the page object, whereafter WebEngine will automatically perform the injection at the right time and place after the next navigation of the page or pages.

The API was implemented incrementally. In 2015, QWebEngineScript and QWebEngineScriptCollection were added to the Widgets-based API. In a follow-up patch, QQuickWebEngineScript was added to the Quick-based API. Fun fact, in a last minute API review cleanup, the script collection getter in QWebEngineProfile was changed from returning a reference to returning a pointer to be more idiomatic Qt. The getter in QWebEnginePage was forgotten, which is how we now have inconsistent return types for the two getters. In 2016, support was added for Greasemonkey pattern rules in metadata comments of the script source code. In early 2017, profile-wide scripts were added to the Quick-based API, bringing it to feature parity with the Widgets-based API. At the same time in 2017, QQuickWebEngineScript was made public to allow scripts to be managed from C++ in a Qt Quick application.

Under the hood, the injection process looks quite similar to WebKit, with the addition of a bunch of coordinating code for IPC. There are two main implementation classes, UserResourceControllerHost in the main process, one for each profile, and one UserResourceController in each subprocess (only one is needed since subprocesses cannot be shared between different profiles). The host class maintains a set of scripts, which includes all of the scripts added to the profile and also the scripts added to each page. The client class maintains a set of scripts, which likewise includes all of the scripts of the profile and each page, but only for pages that have frames in its subprocess. The host class is responsible for synchronizing these sets by monitoring the launching of new subprocesses in a profile and the creation of new frames in each subprocess. The client class is responsible for actually injecting the scripts into the correct frames using the Blink injection API. There are, of course, some complications with injection, for example, the DidClearWindowObject event cannot be used to trigger injection because its disabled if JavaScript is disabled in Blink, but we would injection to work regardless and only disable the standard scripts of the page. Instead, we have to post a task from the DidCommitProvisionalLoad, which should end up being executed at exactly the same time as DidClearWindowObject, unless, of course, the underlying Chromium implementation changes…

As mentioned, the runJavaScript method of Qt WebEngine has two important limitations in comparison to the evaluateJavaScript of Qt WebKit. Both of these are overcome by the "declarative" API of Qt WebEngine. The script objects can be specified to run on either the main frame only, or on all subframes too. Likewise, since the script objects have a specified injection point, there is a guaranteed ordering in relation to loading events. However, the new API comes with its own limitations. First, the scripts objects are used for injection only after the next navigation, meaning that it is still not possible to inject code into all subframes of an existing page. Second, the scripts are executed in an undefined, essentially pseudo-random, order. We therefore lose guaranteed ordering relative to other scripts. If one script has a dependency on another, they only way to ensure that they are executed in the correct order is to turn them into one script by concatenating their source code. In that case, however, the possibility of specifying separate pattern matching rules is lost. (Separate isolated worlds also cannot be specified, but that is not too likely to be useful.) Third, there is no way to get back the result from JavaScript execution. If this needed, then the runJavaScript API must be used, but since this does not provide guaranteed ordering relative to loading events, the call to runJavaScript must be made at the right time by waiting for the loading events of the page, which are not always entirely reliable, or the call must be executed again on failure.

In contrast, Chromium provides a programmatic injection API similar to runJavaScript, which overcomes its two limitations directly. First, Chromium's executeScript allows scripts to be injected into either the main frame only or main frame plus subframes (allFrames parameter), or even into a specific frame (frameId parameter). Second, Chromium's executeScript takes a runAt parameter to indicate the injection point; the script is guaranteed to not be injected before the specified point, but may be injected later, in case the call to executeScript is made after the document has already been loaded. Even if the script is not injected immediately, but is waiting for the injection point, Chromium still can return the result of the execution via callback. Similar extensions could, in principle, be made to runJavaScript, although, as described in the next section, the implementation is non-trivial. Of course, this would still not result in a runJavaScript that is equally powerful to evaluateJavaScript, since even if we provided an ordering guaranteed between loading events and the actual injection, we still would not have a definite order between loading events and the call runJavaScript that sets up the injection; the only way to do that would be to have a fully synchronous API like Qt WebKit together with its performance implications for the multi-process situation. Thus, even with an extended runJavaScript, we still would need the declarative API as well.

If we had such an extended runJavaScript, plus the existing declarative API, then seemingly the only remaining fundamental limitation would be the lack of a definite injection order in the declarative API. Such an order could however be relatively easily guaranteed by changing the semantics of QWebEngineScriptCollection from a set of scripts to a list of scripts so that the application can define its own order for injection. However, some limitations still remain, which are not fundamental but only inconveniences. First, although we support pattern matching rules in the declarative API, these can only be specified as comments in a special metadata block of the JavaScript code. It might be nice to have the possibility to specify these directly on the QWebEngineScript object as ordinary properties. Second, there is the problem of double parsing the metadata block. Since a full implementation of the Greasemonkey spec requires the application to implement its own metadata parser, its now possible for the parsers to disagree. (This has actually happened in qutebrowser, where it resulted in a difficult to track down bug due to the different handling of the UTF-8 byte order mark in the two parsers.) One option would be to parse and expose the full metadata, but if we do not actually implement the underlying behavior required by the spec, then this might be misleading to users of the API. However, if we do implement the possibility to specify pattern matching rules as attributes of the script object, then we could also allow the option of disabling the builtin metadata parser, which would result in only one parser as the canonical source of metadata. Third and last, there is no direct support for injecting CSS stylesheets. Although CSS can be injected by injecting JavaScript code that adds a <style> node to the DOM, it might be more convenient to have direct support for this in QWebEngineScript and runJavaScript (runStyleSheet?).

Script Injection in Chromium's Extensions Module

Script injection forms only a small of Chromium's implementation extensions, however this small part is certainly large enough to deserve its own section.

In UserScript, we have the rough equivalent of a QWebEngineScript, but only a rough equivalent since even though the documentation comment for this class claims that it represents a standalone user script or the content script of an extension, it nonetheless contains not one script, but a list of scripts. In fact, it contains two lists of scripts: JavaScript files and CSS files. Additionally, it contains a number of properties, the serialization methods Pickle and Unpickle, plus a seemingly random assortment of helper functions. Note that properties such as the injection point and the pattern matching rules are defined per UserScript, not per JavaScript or CSS file. The files are represented as UserScript::File objects, which contain a path to the file and either an std::string with the file contents or a base::StringPiece that refers back to the file contents in a shared-memory region that is used for IPC between the browser and renderers. Each script has a unique ID, which as such must be generated in the browser process.

Scripts are loaded in the browser process by objects of the class UserScriptLoader. Each loader owns a set of UserScript objects and triggers loading whenever scripts are added or removed from this set. The actual loading is delegated to subclasses, either ExtensionUserScriptLoader or WebUIUserScriptLoader. These subclasses go over all of the UserScript::File objects in each UserScript, load the script contents for the file path, and store the contents inside the UserScript::File object as an std::string. They then serialize the whole set of UserScript objects and copy the result into a newly created shared memory region. Then UserScriptLoader takes over again and broadcasts a handle to the shared-memory region to each renderer process. The corresponding object on the renderer is the UserScriptSetManager.

The ExtensionUserScriptLoader subclass is used for loading script files of extensions. If the extension is a component extension, then files are loaded from the resource bundle, otherwise from inside the profile directory where the files of an installed extension are stored. Conversely, the WebUIUserScriptLoader subclass loads script files over the network service, which is apparently somehow used when a WebView is embedded within a WebUI to load the embedder's content scripts.

The UserScriptSetManager on the rendered side receives the shared memory region from the UserScriptLoader and hands it off to a UserScriptSet. The manager has one UserScriptSet for all scripts that have been statically defined in extension manifests, and separate sets for the programmatically-defined scripts of each extension. The UserScriptSet maintains the deserialized UserScript objects on the renderer side and creates ScriptInjection objects to perform injections.

TODO

  • ScriptInjection
  • ScriptInjector
 * ProgrammaticScriptInjector
 
 * UserScriptInjector
  • ScriptInjectionManager
  • ScriptContext
  • ScriptContextSet
  • DeclarativeUserScriptMaster
  • SharedUserScriptMaster
  • UserScriptListener

Implementation of Chromium Extensions

TODO