Flakiness information: Difference between revisions

Revision as of 11:46, 23 October 2024

Make sure to also check the overview of the Qt test system.

At the end of every week, we do some analytics and we prepare a list of the most failing flaky tests for a given week.

Those lists are available on this flaky summary dashboard.

If you are having trouble reproducing a flaky autotest failure, there are some suggestions here.

Why flakiness happen?

Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results. In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.

So far we identified several sources of flakiness:

flakiness rooted in source code flakiness related to a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system-flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness related to running environment settings - some tests are sensitive e.g. to slower-than-usual memory.

Why reducing flakiness is important?

Flaky tests are the reason for about 30% of failing integrations. They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption. Most running software has some instabilities, the goal is to reduce it as much as is sensible and possible.

How did we get flaky and failed numbers?

Each integration result run on CI (continuous integration system, coin) is stored in a database. We process stored information and display it in a tool called Grafana.

Below simplified view is presented (scroll down), with information relevant to developers. Note, that the actual process is more complex and is not presented here.

What are health-checks ?

To monitor the stability of CI successful qt5 integrations are re-run. The last successful qt5 integration (qt5 dev HEAD with all its submodules at the right revision according to the last submodule update). - is re-run at night to check how stable is CI. Those runs called "HEALTHCHECKS" were demonstrated to contain passing, working code and as such should pass. However, due to CI instabilities they often fail. We collect information about such fails and call it "CI flakiness" (CI instabilities) and we add it to flaky failed summary statistics

What are insignificant runs?

Insignificant runs (work items) are defined for selected platforms. They are "allowed to fail", their fails do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.

What is a platform?

Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (e.g. Intel x86_32 bits or x86_64 bits, ARM 64, etc. ), and compiler (e.g. clang, gcc, msvc etc.) . Additionally, extra features can be defined, and a subset of tests can be run.

The diagram explaining visually the data process:

diagram of counting flakiness statistics

@@ Line 1: / Line 1: @@
+Make sure to also check the [[Qt test system|overview of the Qt test system]].
 At the end of every week, we do some analytics and we prepare a list of the most failing flaky tests for a given week.
-Those lists are available on this flaky summary dashboard.
+Those lists are available on this flaky summary [https://testresults.qt.io/grafana/d/000000007/flaky-summary-ci-test-info-overview?orgId=1 dashboard].
+If you are having trouble reproducing a flaky autotest failure, there are some suggestions [[How to reproduce autotest fails|here]].
+'''Why flakiness happen?'''
-'''Why flakiness happen?
+Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results.  In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.
-'''
-Flakiness is an umbrella term for tests and build instabilities. In an ideal scenario, if we run compiled code n times, it should provide exactly the same n results.  In the real world, we receive different results, and the impact is particularly important when those are test results in integrations.
 So far we identified several sources of flakiness:
-flakiness rooted in source code
+flakiness rooted in source code flakiness related to a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system-flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness related to running environment settings - some tests are sensitive e.g. to slower-than-usual memory.
-flakiness related to a particular operating system version on which it is run. There can be multiple reasons for it, often related to the abstraction layer between mouse events and windowing system
-flakiness related to virtualization - most of the runs are executed on virtual machines. Not all virtualization mimics well enough intended hardware, those shortcomings lead to flakiness
+'''Why reducing flakiness is important?'''
-flakiness related to running environment settings - some tests are sensitive eg. to slower-than-usual memory.
-Why reducing flakiness is important?
 Flaky tests are the reason for about 30% of failing integrations.  They cause lots of frustration and deplete trust in tests because developers are not sure if tests fail for a reason, or because of instabilities. We rerun flaky tests up to 5 times - this rerun slows down integration and requires more power consumption.  Most running software has some instabilities, the goal is to reduce it as much as is sensible and possible.
 '''How did we get flaky and failed numbers?'''
@@ Line 23: / Line 27: @@
 Below simplified view is presented (scroll down), with information relevant to developers. Note, that the actual process is more complex and is not presented here.
-'''What are heathchecks integrations?
-'''
-To monitor the stability of CI successful qt5 integrations are re-rununed.  The last successful qt5 integration together with its submodules in corresponding versions  - is re-run at night to check how stable is ci. Those integrations called "HEALTCHECKS" demonstrated to have to pass, working code - however due to CI instabilities they sometimes fail when re-run. We collect this information and label it as "CI flakiness" (CI instabilities) and we add it to flaky failed statistics
+'''What are health-checks ? '''
+To monitor the stability of CI successful qt5 integrations are re-run.  The last successful qt5 integration (qt5 dev HEAD with all its submodules at the right revision according to the last submodule update).  - is re-run at night to check how stable is CI. Those runs called "HEALTHCHECKS" were demonstrated to contain passing, working code and as such should pass. However, due to CI instabilities they often fail. We collect information about such fails and call it "CI flakiness" (CI instabilities) and we add it to flaky failed summary statistics
 '''What are insignificant runs?'''
-Insignificant runs (work items) are defined for selected platforms.  They are "allowed to fall", their falls do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.
+Insignificant runs (work items) are defined for selected platforms.  They are "allowed to fail", their fails do not impact (stop) integration. An insignificant platform run does not cause Gerrit stage to fail even if it fails.
-'''What is a platform?'''
-Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (eg. Intel x86_32 bits or x86_64 bits,  ARM 64, etc ), and compiler (eg. clang, gcc, mvsc etc) . Additionally, extra features can be defined, and a subset of tests can be run.
-If you are interested in even more details - the scripts that do the analytics are available under this link
+'''What is a platform?'''
+Qt library is multi-module and cross-platform. Often, but not always qt modules are built on what we call host platform, but intended to run and test on what we call the target platform. In most cases, the platform is defined by the version of the operating system (e.g. Mac OS 12, or Ubuntu 20), processor architecture (e.g. Intel x86_32 bits or x86_64 bits,  ARM 64, etc. ), and compiler (e.g. clang, gcc, msvc etc.) . Additionally, extra features can be defined, and a subset of tests can be run.
-The diagram explaining visually the data process:
+The diagram explaining visually the data process:<br />
+[[File:FLAKY FAILED STATS.png|thumb|diagram of counting flakiness statistics|alt=|center|2381x2381px]]
+<br />

Flakiness information: Difference between revisions

Revision as of 11:46, 23 October 2024

Navigation menu

Search