Katrina the Tester: A stability strategy for test automation

Tuesday 30 January 2018

A stability strategy for test automation

As part of the continuous integration strategy for one of our products, we run stability builds each night. The purpose is to detect changes in the product or the tests that cause intermittent issues, which can be obscured during the day. Stability builds give us test results against a consistent code base during a period of time that our test environments are not under heavy load.

The stability builds execute a suite of web-based user interface automation against mocked back-end test data. They run to a schedule and, on a good night, we see six successful builds:

The builds do not run sequentially. At 1am and 4am we trigger two builds in close succession. These execute in parallel so that we use more of our Selenium Grid, which can give early warning of problems caused by load and thread contention.

When things are not going well, we rarely see six failed builds. As problems emerge, the stability test result trend starts to look like this:

In a suite of over 250 tests, there might be a handful of failures. The number of failing tests, and the specific tests that fail, will often vary between builds. Sometimes there is an obvious pattern e.g. tests with an image picker dialog. Sometimes there appears to be no common link.

Why don't we catch these problems during the day?

These tests are part of a build pipeline that includes a large unit test suite. In the build that is run during the day, the test result trend is skewed by unit test failures. The developers are actively working on the code and using our continuous integration for fast feedback.

Once the unit tests are successful, intermittent issues in the user interface tests are often resolved in a subsequent build without code changes. This means that the development team are not blocked, once the build executes successfully they can merge their code.

The overnight stability build is a collective conscience for everyone who works on the product. When the build status changes state, a notification is pushed into the shared chat channel:

Each morning someone will look at the failed builds, then share a short comment about their investigation in a thread of conversation spawned from the original notification message. The team decide whether additional investigation is warranted and how the problem might be addressed.

It can be difficult to prioritise technical debt tasks in test automation. The stability build makes problems visible quickly, to a wide audience. It is rare that these failures are systematically neglected. We know from experience that ignoring the problems has a negative impact on cycle time of our development teams. When it becomes part of the culture to repeatedly trigger a build in order to get a clean set of test results, everything slows down and people become frustrated.

If your user interface tests are integrated into your pipeline, you may find value in adopting a similar approach to stability. We see benefits in early detection, raising awareness of automation across a broad audience, and creating shared ownership of issue resolution.

4 comments:

Anuj Sharma31 January 2018 at 12:05
"These tests are part of a build pipeline that includes a large unit test suite. In the build that is run during the day, the test result trend is skewed by unit test failures. The developers are actively working on the code and using our continuous integration for fast feedback."

So this assumes that the developers have fixed the defects found during unit testing on that day ?
ReplyDelete
Replies
Unknown1 February 2018 at 01:26
Hello Katrina, thanks for sharing your strategy to stabilize the testautomation - a huge topic in many projects. I've two questions on the mocked back-end testdata:

- Are the same test also used against an fully integrated system and do you double-check if an test fails against the mocked-data and went well against the real back-end. This could be an hint, that the data should be updated. Or the situation vise vera could happen ... so you have an false positive test.

- is the mocked backend implemented by your own or are you using an framework or commercial software?

Best, Sven
ReplyDelete
Replies

Add comment