Qt test system and Flakiness information: Difference between pages

From Qt Wiki
(Difference between pages)
Jump to navigation Jump to search
(Fix link style)
 
No edit summary
 
Line 1: Line 1:
Make sure to also check the [[Qt test system|overview of the Qt test system]].


Test information for the Qt Project is collected on 3 levels:
At the end of every week, we do some analytics and we prepare a list of the most failing flaky tests for a given week.


Those lists are available on this flaky summary [https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info-overview?orgId=1 dashboard].


[[File:Qt test system.png|center|thumb|1037x1037px|Simplified diagram explaining 3 level Qt test system]]
If you are having trouble reproducing a flaky autotest failure, there are some suggestions [[How to reproduce autotest fails|here]].


'''Why flakiness happen?'''


=== 1) Qt test library LEVEL ===
Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results.  In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.
'''''basic functionality'''''


[https://code.qt.io/cgit/qt/qtbase.git/tree/src/testlib Qt test library] provides fundamental [https://doc.qt.io/qt-6/qttest-index.html testing functionality].  Automatic tests need to include test lib.
So far we identified several sources of flakiness:


After the finished run, the test should return one of the following results: PASS or FAIL, in special cases other values are possible see:  [https://doc.qt.io/qt-6/qtest.html#QEXPECT_FAIL XFAIL]
1) flakiness rooted in source code flakiness possibly related a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system, timing or other code issues.  


'''''crashes'''''
2) flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness related to running environment settings - some tests are sensitive e.g. to slower-than-usual memory.


Crashes happens when test executable do not provide result, or the output in a form of XML is corrupted. During test execution different errors/signals can occur - such as SEGFAULT, SIGBART and others, the root cause may lay in test source code or other parts of Qt library. 
3) flakiness related to crashes.  Test execution should return a result, however in some situation test ends prematurely. The test executable provides non-ok return codes, the output xml file is truncated. In such cases we assume the reason is a crash and we attribute it to the last executed function. In some cases more information can be found in logs - such as signals (SIGKILL, SIGBART), timeouts, or a stack call.  The root of crash is not necessary the test executable - but also qt library or 3rd party libraries


'''''blacklisting'''''
4) flakiness related to CI - continuous integration system 


Additional functionality is blacklisting. Blacklisted tests are run, and provide results, but the result is ignored at the CI level. Note that those tests are still executed and thus consume resources, they return results (BPASS and BFAIL) and outcomes are stored in database.
'''Why reducing flakiness is important?'''


Blacklisted tests however do not impact the success of integration. Blacklisting should be treated as a temporary solution - SHORT TERM quarantining of the test. More information about the blacklist format can be found [https://code.qt.io/cgit/qt/qtbase.git/tree/src/testlib/qtestblacklist.cpp here].
Flaky tests are the reason for about 30% of failing test integrations, another group are crashes roughly 30% reasons of failure.  They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption. Failing workitems require re-running and that makes impact on the CI capacity. If the max CI capacity is reached it additionally causes timeouts and more failures - making it negative feedback and complex system  Additionally qt project is growing and every year new automatic test are added and the code is build on new operating systems. This constant change requires constant customizing/maintaining of the automatic test to keep the flakiness low. In many project automatic tests source code has twice as many lines as actual code under test. Flakiness is observed in most software, not just Qt project code.  


'''''skipping'''''
'''How did we get flaky and failed numbers?'''


Some of tests are not supposed to be run on some platforms. Disabling the test on a particular platform should be achieved by using the [https://doc.qt.io/qt-6/qtest.html#QSKIP QSKIP macro]. The skipped test is not being executed and is not returning any results. See also: [https://doc.qt.io/qt-5/qttest-best-practices-qdoc.html#select-appropriate-mechanisms-to-exclude-tests Select Appropriate Mechanisms to Exclude Tests]
Each integration result run on CI (continuous integration system, coin) is stored in a database. We process stored information and display it in a tool called Grafana.


=== 2) CI LEVEL ===
Below simplified view is presented (scroll down), with information relevant to developers. Note, that the actual process is more complex and is not presented here.
CI (coin or Continuous Integration system) automates the builds and manages test runs on different platforms.  The platforms include widely known popular desktop operating systems such as macOS, Windows, or Ubuntu, but they can be also QNX or Qemu.


Regularly, Qt code undergoes testing on approximately 20 distinct operating systems (platforms), with this list dynamically changing. Given Qt's cross-platform nature, it allows the creation of tests on one platform and their execution on another, in most cases actual platform is virtualized (though there are exceptions, such as macOS). The automated testing across numerous virtualized platforms facilitates swift integration but introduces new challenges. One such challenge involves maintaining system stability and ensuring repeatable results. For various reasons, tests may fail, and the root cause of failure may not necessarily be tied to errors in the source code. Currently we are working on distinguishing between issues originating in the integration process (coin) and those linked to the test source code.
'''What are health-checks ? '''


'''''flakiness'''''
To monitor the stability of CI successful qt5 integrations are re-run.  The last successful qt5 integration (qt5 dev HEAD with all its submodules at the right revision according to the last submodule update).  - is re-run at night to check how stable is CI. Those runs called "HEALTHCHECKS" were demonstrated to contain passing, working code and as such should pass. However, due to CI instabilities they often fail. We collect information about such fails and call it "CI flakiness" (CI instabilities) and we add it to flaky failed summary statistics


Coin introduces flakiness classification of test. Each test that failed is repeated up to 5 times (see [https://testresults.qt.io/grafana/d/3dhio4K7k/fastcheck-ci-test-info?orgId=1&viewPanel=72 link]). A test is classified as failed only if it fails 6 times in row (1st run plus 5 repetitions). Both failed and flaky test runs are stored in database. We collect history of such tests - for  dev branch - information can be found on [https://testresults.qt.io/grafana/d/3dhio4K7k/fastcheck-ci-test-info?orgId=1 "Fast check" dashboard], for all other branches - on [https://testresults.qt.io/grafana/d/000000009/slow-check-all-projects-and-branches-detailed-view-ci-test-info?orgId=1 "Slow check" dashboard].
'''What are insignificant runs?'''


'''''integration related crashes'''''
Insignificant runs (work items) are defined for selected platforms.  They are "allowed to fail", their fails do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.


In general tests that for various reasons do not produce results are classified as crashes. The term is not limited to signals ending a process, crashes can be caused by timeouts and other events. Some of the tests are not even run because of e.g. missing library. As mentioned earlier identifying and classifying crashes is an area on which we focus currently.
'''What is a platform?'''


=== 3) TEST RESULTS STATISTICS LEVEL ===
Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (e.g. Intel x86_32 bits or x86_64 bits,  ARM 64, etc. ), and compiler (e.g. clang, gcc, msvc etc.) . Additionally, extra features can be defined, and a subset of tests can be run.
We store integration results in the database for about two years and regularly analyze them.


'''''correlation between flakiness and failures'''''
The diagram explaining visually the data process:<br />
 
[[File:FLAKY FAILED STATS.png|thumb|diagram of counting flakiness statistics|alt=|center|2381x2381px]]
One of important conclusions is the high correlation between flakiness and failures.  In simpler terms, if a test isn't very reliable, it will eventually fail and disrupt the integration process.
<br />
 
'''''increasing number of blacklisted tests'''''
 
Another observation is the increasing number of blacklisted tests.  Tests are created on purpose and their goal is to make sure the implemented functionality works and is STILL WORKING as designed.  By blacklisting we unwittingly create the possibility of introducing new bugs that would be otherwise easily caught by test system early on.  In other words we create holes/vulnerabilities.  That's why we  should only use the blacklist as a last resort — a way to temporarily "quarantine" the test.  Quarantine helps to identify root of flakiness - infrastructure related tests most likely will eventually start passing, once the environment will change (keep in mind running environment is has its fluctuations), in other cases - if after some time (more than 3 months) the test is still failing - it requires fixing (or replacing) - it should not stay blacklisted.
 
'''''identifying the most impactful tests'''''
 
Dealing with flakiness is an ongoing and sometimes challenging task. We're putting in consistent effort to ensure our tests maintain good quality. Our analysis pointed out which tests are causing the most trouble in the integration system. The most negatively impactful tests are presented on [https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info-overview?orgId=1 "Flaky summary" dashboard]. More information on how we got those numbers is available [[Flakiness information|here]].
 
 
 
'''GLOSSARY and links for further reading:'''
 
COIN - [https://testresults.qt.io/coin/doc/ Qt Continuous Integration System]
 
GRAFANA - observability and  visualization [https://grafana.com/ platform] used to display [https://testresults.qt.io/grafana/d/000000012/overview?orgId=1 test results and integration metrics]
 
Qt test library [https://doc.qt.io/qt-6/qttest-index.html documentation].
 
[https://code.qt.io/cgit/qt/qtbase.git/tree/src/testlib Qt test library source code]
 
QSKIP macro [https://doc.qt.io/qt-6/qttestlib-tutorial6.html functionality] and [https://doc.qt.io/qt-6/qtest.html#QSKIP documentation].
 
BLACKLIST [https://code.qt.io/cgit/qt/qtbase.git/tree/src/testlib/qtestblacklist.cpp file format]
 
Guidelines to [https://doc.qt.io/qt-6/qttestlib-tutorial1-example.html writing unit tests]

Latest revision as of 16:45, 20 September 2025

Make sure to also check the overview of the Qt test system.

At the end of every week, we do some analytics and we prepare a list of the most failing flaky tests for a given week.

Those lists are available on this flaky summary dashboard.

If you are having trouble reproducing a flaky autotest failure, there are some suggestions here.

Why flakiness happen?

Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results. In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.

So far we identified several sources of flakiness:

1) flakiness rooted in source code flakiness possibly related a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system, timing or other code issues.

2) flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness related to running environment settings - some tests are sensitive e.g. to slower-than-usual memory.

3) flakiness related to crashes. Test execution should return a result, however in some situation test ends prematurely. The test executable provides non-ok return codes, the output xml file is truncated. In such cases we assume the reason is a crash and we attribute it to the last executed function. In some cases more information can be found in logs - such as signals (SIGKILL, SIGBART), timeouts, or a stack call. The root of crash is not necessary the test executable - but also qt library or 3rd party libraries

4) flakiness related to CI - continuous integration system

Why reducing flakiness is important?

Flaky tests are the reason for about 30% of failing test integrations, another group are crashes roughly 30% reasons of failure. They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption. Failing workitems require re-running and that makes impact on the CI capacity. If the max CI capacity is reached it additionally causes timeouts and more failures - making it negative feedback and complex system Additionally qt project is growing and every year new automatic test are added and the code is build on new operating systems. This constant change requires constant customizing/maintaining of the automatic test to keep the flakiness low. In many project automatic tests source code has twice as many lines as actual code under test. Flakiness is observed in most software, not just Qt project code.

How did we get flaky and failed numbers?

Each integration result run on CI (continuous integration system, coin) is stored in a database. We process stored information and display it in a tool called Grafana.

Below simplified view is presented (scroll down), with information relevant to developers. Note, that the actual process is more complex and is not presented here.

What are health-checks ?

To monitor the stability of CI successful qt5 integrations are re-run. The last successful qt5 integration (qt5 dev HEAD with all its submodules at the right revision according to the last submodule update). - is re-run at night to check how stable is CI. Those runs called "HEALTHCHECKS" were demonstrated to contain passing, working code and as such should pass. However, due to CI instabilities they often fail. We collect information about such fails and call it "CI flakiness" (CI instabilities) and we add it to flaky failed summary statistics

What are insignificant runs?

Insignificant runs (work items) are defined for selected platforms. They are "allowed to fail", their fails do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.

What is a platform?

Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (e.g. Intel x86_32 bits or x86_64 bits, ARM 64, etc. ), and compiler (e.g. clang, gcc, msvc etc.) . Additionally, extra features can be defined, and a subset of tests can be run.

The diagram explaining visually the data process:

diagram of counting flakiness statistics