Flakiness information: Difference between revisions

Latest revision as of 11:42, 18 June 2026

Before reading this page, make sure you are familiar with overview of the Qt test system and COIN glossary.

This page explains algorithm used in processing stastistics about top flaky tests.

At the end of every week, we do some analytics and we generate a list of the most failing unreliable tests for a given week. Those weekly lists are published every Monday and are available on Flaky Summary Weekly Top Flaky Tests dasbhoard.

Why flakiness happen?

Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results. In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.

So far we identified several sources of flakiness:

1) flakiness rooted in source code flakiness possibly related a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system, timing or other code issues.

2) flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness related to running environment settings - some tests are sensitive e.g. to slower-than-usual memory.

3) flakiness related to crashes. Test execution should return a result, however in some situation test ends prematurely. The test executable provides non-ok return codes, the output xml file is truncated. In such cases we assume the reason is a crash and we attribute it to the last executed function. In some cases more information can be found in logs - such as signals (SIGKILL, SIGBART), timeouts, or a stack call. The root of crash is not necessary the test executable - but also qt library or 3rd party libraries

4) flakiness related to CI - continuous integration system

Why reducing flakiness is important?

Flaky tests are the reason for about 30% of failing test integrations, another group are crashes roughly 30% reasons of failure. They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption. Failing workitems require re-running and that makes impact on the CI capacity. If the max CI capacity is reached it additionally causes timeouts and more failures - making it negative feedback and complex system Additionally qt project is growing and every year new automatic test are added and the code is build on new operating systems. This constant change requires constant customizing/maintaining of the automatic test to keep the flakiness low. In many project automatic tests source code has twice as many lines as actual code under test. Flakiness is observed in most software, not just Qt project code.

How are statistics about top unreliable tests processed?

Each integration/task results executed in COIN (Qt's in-house Continuous Integration system, COIN) are stored in a database. We process this data and visualize it using a tool called Grafana.

Top unreliable test list contains only failures that have the greatest negative impact on code integrations in the dev branch. Therefore, failures occuring in Early Warnings, Status Checks and on insignificant platforms are exclded from the analysis, because they do not directly impact intended code integrations.

HealthChecks runs, however, are included because they expose failures in code that has already been integrated. Capturing and fixing those failures helps improve the overall stability of the test codebase.

Tests that are only flaky and eventually pass within the maximum of five reruns are excluded from the analysis because they do not have a significant impact on integrations.

Top unreliable tests are thus: crashes, and those flaky tests that also failed in CI during given week.

At the bottom of this page (scroll down), you can find a visual diagram that illustrates the information most relevant to developers. Note that the actual process is more complex than shown in the diagram and is not fully represented there.

Once tests are gathered to list, they are ranked based on the total number of crashes and failures recorded during a given week. The ranking score is calculated as the sum of these crashes and failures.

For the top five tests with the highest scores, JIRA tickets are automatically created. The dashboard used to track these tickets is available here.

What are criteria for closing ticket related to top flaky tests?

An issue can be considered resolved when the automated test has not failed or crashed in integrations or Health Check runs on the dev branch for three consecutive months prior the ticket's closure.

A code change is recommended but not mandatory. In some cases, test flakiness is caused by the CI infrastructure itself rather than by the test or source code, and therefore cannot be fixed through code changes.

If you are having trouble reproducing a flaky autotest failure, there are some suggestions here.

What are Health Checks?

To monitor the stability of CI successful qt5 integrations are re-run. The last successful qt5 integration (qt5 dev HEAD with all its submodules at the right revision according to the last submodule update). - is re-run at night to check how stable is CI. Those runs called "HEALTHCHECKS" were demonstrated to contain passing, working code and as such should pass. However, due to CI instabilities they often fail. We collect information about such fails and call it "CI flakiness" (CI instabilities) and we add it to flaky failed summary statistics

What are insignificant runs?

Insignificant runs (work items) are defined for selected platforms. They are "allowed to fail", their fails do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.

What is a platform?

Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (e.g. Intel x86_32 bits or x86_64 bits, ARM 64, etc. ), and compiler (e.g. clang, gcc, msvc etc.) . Additionally, extra features can be defined, and a subset of tests can be run.

The diagram explaining visually the data process:

@@ Line 1: / Line 1: @@
-At the end of every week, we do some analytics and we prepare a list of the most failing flaky tests for a given week.
+Before reading this page, make sure you are familiar with [[Qt test system|overview of the Qt test system]] and [[Coin glossary for Grafana users|COIN glossary]].
-Those lists are available on this flaky summary dashboard.
-'''Why flakiness happen?'''
+This page explains algorithm used  in processing stastistics about top flaky tests.
+At the end of every week, we do some analytics and we generate a list of the most failing unreliable tests for a given week. Those weekly lists are published every Monday and are available on  [https://testresults.qt.io/grafana/d/000000007/7d43a2c?orgId=1&from=now-7d&to=now&timezone=utc&var-project=$__all Flaky Summary Weekly Top Flaky Tests dasbhoard].
+==Why flakiness happen?==
 Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results.  In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.
 So far we identified several sources of flakiness:
-flakiness rooted in source code
+) flakiness rooted in source code flakiness possibly related a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system, timing or  other code issues.
-flakiness related to a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system
-flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness
+) flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness related to running environment settings - some tests are sensitive e.g. to slower-than-usual memory.
-flakiness related to running environment settings - some tests are sensitive eg. to slower-than-usual memory.
-Why reducing flakiness is important?
+) flakiness related to crashes.  Test execution should return a result, however in some situation test ends prematurely. The test executable provides non-ok return codes, the output xml file is truncated. In such cases we assume the reason is a crash and we attribute it to the last executed function. In some cases more information can be found in logs - such as signals (SIGKILL, SIGBART), timeouts, or a stack call.  The root of crash is not necessary the test executable - but also qt library or 3rd party libraries
+) flakiness related to CI - continuous integration system
+==Why reducing flakiness is important?==
+Flaky tests are the reason for about 30% of failing test integrations, another group are crashes roughly 30% reasons of failure.  They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption. Failing workitems require re-running and that makes impact on the CI capacity. If the max CI capacity is reached it additionally causes timeouts and more failures - making it negative feedback and complex system   Additionally qt project is growing and every year new automatic test are added and the code is build on new operating systems.  This constant change requires constant customizing/maintaining of the automatic test to keep the flakiness low. In many project automatic tests source code has twice as many lines as actual code under test.  Flakiness is observed in most software, not just Qt project code.
-Flaky tests are the reason for about 30% of failing integrations.  They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption.  Most running software has some instabilities, the goal is to reduce it as much as is sensible and possible.
+==How are statistics about top unreliable tests processed?==
+Each integration/task results executed in COIN (Qt's in-house Continuous Integration system, COIN) are stored in a database. We process this data and visualize it using a tool called [https://testresults.qt.io/grafana/d/000000012/overview?orgId=1&from=now-6h&to=now&timezone=browser Grafana].
-'''How did we get flaky and failed numbers?'''
+Top unreliable test list contains only failures that have the greatest negative impact on code integrations in the dev branch.  Therefore, failures occuring  in Early Warnings, Status Checks and on insignificant platforms are exclded from the  analysis, because they do not  directly impact intended code integrations.
-Each integration result run on CI (continuous integration system, coin) is stored in a database. We process stored information and display it in a tool called Grafana.
+HealthChecks runs, however,  are included because they expose failures in code that has already been integrated. Capturing and fixing those failures  helps improve the overall stability of the test codebase.
-Below simplified view is presented (scroll down), with information relevant to developers. Note, that the actual process is more complex and is not presented here.
+Tests that are only flaky and eventually pass within the maximum of five reruns are excluded from the analysis because they do not have a significant impact on integrations.
-'''What are heathchecks integrations? '''
+Top unreliable tests are thus: crashes, and those flaky tests that also failed in CI during given week.
-To monitor the stability of CI successful qt5 integrations are re-rununed.  The last successful qt5 integration together with its submodules in corresponding versions  - is re-run at night to check how stable is ci. Those integrations called "HEALTCHECKS" demonstrated to have to pass, working code - however due to CI instabilities they sometimes fail when re-run. We collect this information and label it as "CI flakiness" (CI instabilities) and we add it to flaky failed statistics
+At the bottom of this page (scroll down), you can find a visual diagram that illustrates the information most relevant to developers. Note that the actual process is more complex than shown in the diagram and is not fully represented there.
-'''What are insignificant runs?'''
+Once tests are gathered to list, they are ranked based on the total number of crashes and failures recorded during a given week. The ranking score is calculated as the sum of these crashes and failures.
-Insignificant runs (work items) are defined for selected platforms.  They are "allowed to fall", their falls do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.
+For the top five tests with the highest scores, JIRA tickets are automatically created. The dashboard used to track these tickets is available [https://qt-project.atlassian.net/jira/dashboards/10051 here].
+==What are criteria for closing ticket related to top flaky tests?==
+An issue can be considered resolved when the automated test has not failed or crashed in integrations or Health Check runs on the dev branch for three consecutive months prior the ticket's closure.
-'''What is a platform?'''
+A code change is recommended but not mandatory. In some cases, test flakiness is caused by the CI infrastructure itself rather than by the test or source code, and therefore cannot be fixed through code changes.
-Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (eg. Intel x86_32 bits or x86_64 bits,  ARM 64, etc ), and compiler (eg. clang, gcc, mvsc etc) . Additionally, extra features can be defined, and a subset of tests can be run.
+If you are having trouble reproducing a flaky autotest failure, there are some suggestions [[How to reproduce autotest fails|here]].
-If you are interested in even more details - the scripts that do the analytics are available under this link
+== What are Health Checks? ==
+To monitor the stability of CI successful qt5 integrations are re-run.  The last successful qt5 integration (qt5 dev HEAD with all its submodules at the right revision according to the last submodule update).  - is re-run at night to check how stable is CI. Those runs called "HEALTHCHECKS" were demonstrated to contain passing, working code and as such should pass. However, due to CI instabilities they often fail. We collect information about such fails and call it "CI flakiness" (CI instabilities) and we add it to flaky failed summary statistics
+== What are insignificant runs? ==
+Insignificant runs (work items) are defined for selected platforms.  They are "allowed to fail", their fails do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.
+== What is a platform?==
+Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (e.g. Intel x86_32 bits or x86_64 bits,  ARM 64, etc. ), and compiler (e.g. clang, gcc, msvc etc.) . Additionally, extra features can be defined, and a subset of tests can be run.
-The diagram explaining visually the data process:
+The diagram explaining visually the data process:<br />
-<br />
+[[File:FLAKY FAILED STATS.png|thumb|diagram of counting flakiness statistics|alt=|center|2381x2381px]]
-[[File:FLAKY FAILED STATS.png|thumb|774x774px|diagram of counting flakiness statistics]]
 <br />