Test Reliability9 min readApril 28, 2026Updated June 9, 2026

What Are Flaky Tests? Causes, Costs, and How to Fix Them

A flaky test passes and fails on the same code without any change. Flakiness erodes trust in your whole suite and trains teams to ignore failures. Here's what causes it and how to eliminate it.

Written byJordan Blake · Senior QA Engineer

TL;DR

A flaky test produces different results on the same code — sometimes passing, sometimes failing. The usual culprits are timing races, hardcoded waits, shared state between tests, and brittle selectors. The fix is deterministic tests: auto-waiting, isolated state, semantic selectors, and investigating every failure.

A flaky test is one that passes and fails on the exact same code, with no changes in between. Run it once and it is green; run it again and it is red. Because the result is non-deterministic, a flaky test tells you nothing reliable about whether your app actually works — and that is precisely what makes it so damaging.

Why flaky tests are worse than they sound

The real cost of flakiness is not the wasted minutes re-running a build. It is the slow erosion of trust. When failures are sometimes meaningless, developers learn to shrug them off — "just hit retry." Eventually a real failure scrolls by, gets the same shrug, and a genuine bug ships to production. A test suite you cannot trust is worse than no suite at all, because it provides false confidence. We dig into this dynamic in why flaky tests destroy developer trust.

The common causes

1. Timing and race conditions

By far the most frequent cause. The test tries to interact with an element before the page has finished rendering it, or before data has loaded. Sometimes the page is fast enough and the test passes; sometimes it is not and the test fails.

2. Hardcoded waits

Teams often try to fix timing issues by adding a fixed pause — "wait 3 seconds." This is fragile: too short and it still fails on a slow run, too long and it wastes time on every run. The right fix is to wait for a condition (the element is visible, the network request finished), not a fixed duration.

3. Shared state between tests

When tests reuse the same user account, database records, or browser session, one test can leave behind state that breaks the next — and the failure depends on the order they happen to run in. Each test should start from a clean, isolated state.

4. Brittle selectors

If a test finds a button by a CSS class tied to styling, a routine design tweak can break it even though the feature still works. Finding elements by their visible role and text is far more resilient.

5. Uncontrolled external dependencies

Tests that depend on real third-party services, live networks, or the current date and time will fail whenever those things hiccup — through no fault of your app.

Flakiness vs. a genuine failure

Not every failing test is flaky. Knowing the difference saves you from ignoring a real regression while hunting a phantom.

Flaky. The test fails on unchanged code, then passes on a re-run. The result is non-deterministic. The cause is inside the test or its environment.
Genuine regression. The test fails consistently after a specific code change. It passes on the previous commit and fails on the next. The cause is in the application.
Infrastructure failure. The test fails because CI ran out of memory, a network request timed out, or the test environment was unavailable. The result is consistent but the cause is outside both the test and the application.

The fastest way to distinguish them: run the test three times on the same commit without any code change. If the result is inconsistent, it is flaky. If it consistently fails, check whether the failure is reproducible locally. If it only fails in CI, suspect infrastructure.

How to fix flaky tests

The unifying principle is determinism: the same input should always produce the same result. In practice:

Wait for conditions, not clocks. Replace fixed sleeps with waits for the actual state you need. Frameworks like Playwright do this automatically through auto-waiting.
Isolate every test. Give each test its own data and a fresh session so nothing leaks between runs.
Use semantic selectors. Target elements by role and visible text rather than fragile styling hooks.
Control external dependencies. Mock or stub third-party services and pin time-dependent values so runs are repeatable.
Investigate every failure. Never blindly retry. A trace or video of the failing run usually reveals the root cause in minutes.

The most common fix — waiting for a condition instead of a fixed duration — is a small change in code but a large one in reliability:

// Fragile: a fixed pause, both too slow and still race-prone
await page.waitForTimeout(3000)
await page.click('#submit')

// Reliable: wait for the real condition, then act
await expect(page.getByText('Loaded')).toBeVisible()
await page.getByRole('button', { name: 'Submit' }).click()

A word on automatic retries

Many teams configure tests to retry automatically on failure. Used carefully, retries can smooth over rare infrastructure blips. Used as a crutch, they hide flakiness instead of fixing it — the test still flakes, you just stop seeing it, and the underlying race condition remains. Treat a retry as a signal to investigate, not a solution.

Reliability is a practice, not a one-time fix

Flakiness creeps back in as an app grows, so keeping a suite stable is ongoing work: consistent patterns, disciplined isolation, and a habit of investigating every red build. This is one of the hardest parts of testing to sustain in-house, and a major reason teams turn to managed QA.

QA Guardian builds tests that are deterministic by design — auto-waiting, isolated state, semantic locators — and a human investigates every failure, so a red result always means something real. Book a demo to see what a zero-flake suite looks like on your app.

Frequently asked questions

What causes flaky tests?

The most common causes are timing issues and race conditions, hardcoded sleeps instead of waiting for conditions, shared state that leaks between tests, brittle selectors tied to styling, and dependencies on real networks or third-party services.

How do you fix a flaky test?

Replace hardcoded waits with condition-based waiting, isolate each test's state, use stable semantic selectors, mock or control external dependencies, and treat every failure as a real signal to investigate rather than re-running until it passes.

How do you know if a test is flaky or pointing to a real bug?

Run the test in isolation several times without any code change. If it sometimes passes and sometimes fails on identical code, it is flaky. If it consistently fails only after a specific code change, it is pointing at a real regression. Playwright traces help you inspect exactly what the page looked like at the moment of failure.

Is it acceptable to retry flaky tests in CI?

Retries are a short-term painkiller, not a cure. A test that needs retrying to pass is masking a real instability in your suite. Use retries sparingly to avoid blocking deploys while you investigate, but always treat a retried failure as a bug to fix rather than a signal to suppress.

Can flaky tests be eliminated completely?

In practice you can get very close to zero. Most flakiness comes from a small set of root causes — timing races, shared state, brittle selectors, and uncontrolled dependencies — each with a concrete fix. Flakiness tends to creep back as an app grows, so staying flake-free is an ongoing discipline rather than a one-time fix, but a deterministic, well-isolated suite can run reliably for long stretches.

See modern QA in action

Everything we write about is what we build and run every day. Book a demo and we'll show you flow-based Playwright coverage on your own codebase.

Book a Demo Explore the QA Deeper Dive