Every test suite has them: tests that pass on Monday, fail on Tuesday, and pass again on Wednesday without anyone touching the code. The standard response is to label them 'flaky' and move on. But what if flakiness is not just noise—what if it's a signal about the design of your system and your tests? At gleamr.xyz, we think flakiness deserves a closer look, not as a nuisance to be silenced, but as a diagnostic tool that reveals deeper issues in test strategy and system architecture.
This guide is for teams that have tried retries and quarantine lists but still see intermittent failures. We'll show you how to decode flaky test patterns, triage them effectively, and use them to improve both your test suite and your design. By the end, you'll have a framework for turning flaky tests from a source of frustration into a source of insight.
Why Flaky Tests Matter More Than You Think
Flaky tests erode trust in the test suite. When a test fails, the first question becomes: "Is it real or flaky?" That uncertainty slows down development, reduces confidence in deployments, and wastes hours of debugging time. But the real cost is not just the time spent rerunning tests—it's the design blind spots that flakiness reveals.
Consider a typical scenario: a test that checks the order of items in a list after an asynchronous update. It passes locally but fails intermittently in CI. The team adds a sleep, then a retry, then marks it flaky. But the root cause—a race condition between the update and the read—is never addressed. That race condition may also affect production behavior under load, but because the test is flaky, the signal is ignored.
We've seen teams spend months accumulating a "flaky list" of dozens of tests, each one a small leak in the reliability of the suite. The cumulative effect is a test suite that no one trusts, where green builds are celebrated but not believed. This is not just a testing problem—it's a design problem. Flaky tests often point to non-deterministic behavior in the system: shared mutable state, unmanaged concurrency, or dependencies that are not properly isolated. By treating flakiness as a design signal, teams can address the underlying issues rather than just the symptoms.
There's also a human cost. Developers who see frequent flaky failures become conditioned to ignore test results. They stop investigating failures, assuming they'll pass on retry. This erodes the feedback loop that tests are meant to provide. Over time, the test suite becomes a liability rather than an asset. Recognizing flakiness as a design signal helps restore that feedback loop by focusing attention on the root causes.
The Hidden Cost of Retries
Retry mechanisms are a common band-aid for flaky tests. But retries hide the problem without fixing it. A test that requires retries is not reliable; it's just masked. Moreover, retries increase CI time and resource usage, and they can mask real failures that happen to coincide with flaky ones. A better approach is to treat each flaky test as a bug report about the system's design.
Trust as a Testing Metric
We often measure test coverage, execution time, and pass rate. But trust is harder to quantify. One proxy is the number of flaky tests in a suite. A suite with zero flaky tests is more trustworthy than one with a 1% flaky rate, even if the latter has higher coverage. Teams should track flaky tests as a key quality metric and aim to eliminate them, not just tolerate them.
Core Idea: Flakiness as a Design Signal
The central idea of this guide is that flaky tests are not random failures—they are symptoms of non-determinism in the system under test. Non-determinism arises from design choices: shared state between tests, reliance on wall-clock time, external services that behave differently on each call, or data that is not properly seeded. When a test fails intermittently, it's telling you that one of these design elements is not under control.
Think of a flaky test as a canary in the coal mine. It's not the problem itself; it's an early warning of a deeper issue. For example, a test that occasionally fails because of a race condition in an async workflow is pointing to a design that does not handle concurrency correctly. That same race condition could cause data corruption in production under the right conditions. The flaky test is giving you a chance to fix it before it becomes a production incident.
This perspective shifts the conversation from "how do we make this test pass?" to "what is this test telling us about our design?" It changes the triage process from a quick fix (add a retry) to a diagnostic investigation (find and eliminate the source of non-determinism). Over time, this approach leads to a more deterministic test suite and a more robust system.
Types of Flakiness and Their Design Roots
Not all flaky tests are the same. We can categorize them by root cause:
- Race conditions: Two or more operations execute in an unpredictable order. Common in async code, event-driven systems, and multi-threaded applications. Design fix: use synchronization primitives or redesign to avoid shared state.
- Shared mutable state: Tests that modify global data (databases, files, singletons) without proper isolation. Design fix: ensure each test has its own data context, or use test fixtures that reset state.
- Time-dependent behavior: Tests that rely on specific timing (e.g., waiting for a fixed duration). Design fix: use virtual clocks or inject time as a dependency.
- External dependency variability: Tests that call external APIs or services that may return different results. Design fix: mock or stub external dependencies, or use contract testing to verify behavior.
- Environment coupling: Tests that depend on specific environment settings (e.g., locale, timezone, network conditions). Design fix: make tests environment-agnostic by setting explicit conditions in the test setup.
The Flaky Test Triage Framework
When a flaky test appears, follow these steps:
- Reproduce: Try to reproduce the failure locally. If you can't, add logging to capture the state at the time of failure.
- Isolate: Determine if the flakiness is due to the test itself or the system under test. Run the test in isolation to see if it still fails.
- Diagnose: Identify the root cause using the categories above. Look for non-deterministic elements in the test and the system.
- Decide: Choose one of three actions—fix the root cause, quarantine the test, or rewrite the test. The decision depends on the severity of the design issue and the effort required.
- Monitor: Track the test over time to ensure the fix is effective. If the test becomes flaky again, revisit the diagnosis.
How It Works Under the Hood
To understand why flaky tests occur, we need to look at the mechanics of test execution. A test typically follows a setup-exercise-verify-teardown cycle. Flakiness arises when any of these steps depends on something outside the test's control. Let's examine the common mechanisms.
Shared state is the most frequent culprit. In many test suites, tests share a database, file system, or in-memory cache. If one test modifies data that another test reads, the order of execution determines the outcome. Parallel test execution amplifies this problem because tests run concurrently and interact in unpredictable ways. The solution is to ensure each test has its own isolated data context—either by using transactions that roll back, or by creating unique data per test.
Asynchronous operations introduce timing uncertainty. A test that triggers an async operation and then immediately checks for results may see the old state if the operation hasn't completed. The common fix is to wait for a specific condition (e.g., a callback, a change in state) rather than waiting for a fixed amount of time. But even conditional waits can be flaky if the condition is not deterministic—for example, waiting for an event that may never arrive due to a bug.
External services are another source of non-determinism. A test that calls a real API may get different responses based on network latency, server load, or data changes. The standard approach is to mock the external service, but mocks introduce their own problems: they may not accurately reflect the real service's behavior, leading to false positives. A better approach is to use contract testing or a test double that simulates realistic behavior.
Randomness and non-deterministic algorithms can also cause flakiness. If the system under test uses random numbers, the test may pass or fail depending on the random seed. The fix is to seed the random number generator in tests, or to use a deterministic alternative.
The Role of Test Design
Test design itself can introduce flakiness. Tests that are too tightly coupled to implementation details (e.g., checking internal state that changes with refactoring) are more likely to become flaky. Tests that rely on exact output ordering (e.g., comparing lists without sorting) are brittle. Good test design focuses on behavior and outcomes, not internal mechanics. It also avoids assumptions about execution order and timing.
Infrastructure Factors
CI environment variability can also cause flakiness. Different CI runners may have different CPU speeds, memory, or network conditions. A test that passes on a fast machine may fail on a slower one if it relies on timing. The fix is to avoid timing assumptions altogether, or to use a test harness that controls the environment (e.g., Docker containers with fixed resources).
A Walkthrough: Taming a Flaky Integration Test
Let's walk through a composite scenario drawn from real-world patterns. A team has an integration test that creates an order in an e-commerce system, then checks that the inventory count decreases by one. The test passes most of the time but fails about 5% of the runs. The team initially adds a retry, but the flakiness persists.
Step 1: Reproduce — The team runs the test 20 times locally. It fails twice, both times when the CI runner is under load. They add logging to capture the inventory count before and after the order creation.
Step 2: Isolate — They run the test in isolation (no other tests running concurrently). It passes consistently. This suggests the flakiness is due to interference from other tests.
Step 3: Diagnose — They examine the test suite and find that another test also modifies the same inventory database. Both tests use the same product ID. When they run in parallel, the inventory update from one test may be overwritten by the other. The root cause is shared mutable state (the product ID is not unique per test).
Step 4: Decide — The team decides to fix the root cause by generating a unique product ID for each test. They also add a database cleanup step in the teardown to remove test data. This is a relatively low-effort fix that eliminates the flakiness.
Step 5: Monitor — After the fix, the test passes 100% of the time over 100 runs. The team also checks for other tests that share state and fixes them proactively.
This scenario illustrates a common pattern: flakiness caused by test interference. The fix was not a retry or a sleep, but a design change—making each test's data unique. The flaky test was a signal that the test data strategy was flawed.
Another Scenario: Async Timeout
Another team has a test that sends a message to a queue and then checks that the message is processed. The test waits 5 seconds for processing. It fails occasionally when the queue is slow. The team's first instinct is to increase the timeout to 10 seconds, but that makes the test slower and still fails under heavy load.
The root cause is a time-dependent assumption. The fix is to use a polling mechanism that checks for the expected outcome (e.g., a database record) with a short interval and a reasonable timeout. This makes the test faster and more reliable. The flaky test signaled that the design relied on a fixed wait, which is brittle.
Edge Cases and Exceptions
Not all flaky tests are design signals. Some flakiness is genuinely environmental or transient. For example, a network outage during a test run is not a design flaw—it's an infrastructure issue. Similarly, a test that fails due to a bug in a third-party library may be a signal to update the library, not to redesign the system.
Another edge case is non-deterministic production behavior that is intentional. For example, a load balancer that distributes requests randomly may cause a test to see different responses. In this case, the test should be designed to handle any valid response, not a specific one. The flakiness signals that the test is too prescriptive.
There are also cases where flakiness is acceptable. For example, a test that verifies a probabilistic algorithm (e.g., a recommendation engine) may fail occasionally due to randomness. In such cases, the test should be statistical (e.g., check that the result falls within an expected range) rather than exact. But even then, the flakiness should be documented and understood.
When to Quarantine vs. Fix
Not every flaky test needs immediate fixing. If the root cause is a known issue that is being addressed elsewhere (e.g., a planned refactor of the shared state), it may be more efficient to quarantine the test and fix it later. The decision depends on the severity of the flakiness and the effort required. A test that fails 1% of the time and is easy to fix should be fixed. A test that fails 50% of the time and requires a major redesign may be quarantined until the redesign is done.
The Limits of Retries
Retries are a valid strategy for transient infrastructure issues (e.g., network blips), but they should not be the default response to flakiness. A test that requires retries to pass is not reliable. Moreover, retries can mask real failures. A better approach is to use retries only for known transient conditions and to log the retry count so that teams can monitor the frequency.
Limits of the Approach
Treating flaky tests as design signals has its limits. First, it requires time and expertise to diagnose root causes. Teams under pressure to deliver features may not have the bandwidth to investigate every flaky test. In such cases, a pragmatic approach is to triage flaky tests by impact: fix the ones that block releases, quarantine the ones that are rare, and schedule deeper investigations for later.
Second, not all flakiness is fixable. Some systems are inherently non-deterministic (e.g., distributed systems with eventual consistency). In those cases, tests must be designed to tolerate some uncertainty. For example, instead of checking exact state, check that the system eventually reaches a consistent state within a reasonable time. This requires a shift in test design philosophy.
Third, the approach assumes that the test suite is well-maintained. If the test suite is full of poorly written tests, flakiness may be a symptom of test quality rather than system design. In that case, the first step is to improve test design—make tests independent, focused, and deterministic. Only then can flakiness be reliably interpreted as a design signal.
Finally, there is a risk of over-engineering. Not every flaky test needs a deep investigation. Sometimes the simplest fix—like generating unique data—is sufficient. The key is to know when to dig deeper and when to apply a quick fix. The framework we've outlined helps make that decision.
When to Seek Help
If your team is spending more than 10% of its time dealing with flaky tests, it may be worth investing in a dedicated effort to reduce flakiness. This could involve a flaky test bounty program, a weekly triage meeting, or a tool that automatically quarantines flaky tests. The goal is to make flakiness visible and actionable, not to eliminate it overnight.
Next Steps for Your Team
Start by auditing your test suite for flaky tests. Track them in a spreadsheet or a dashboard. For each flaky test, apply the triage framework: reproduce, isolate, diagnose, decide, monitor. Over time, you'll see patterns that point to design issues. Address those issues systematically. And remember: flaky tests are not just noise—they're signals. Listen to them, and your test suite and your system will be better for it.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!