Self healing test automation: why fixing selectors isn't enough

A flaky test is a test that passes and fails on same code without anything changing. You push a PR, CI pipeline goes red, you re-run it, and it goes green. No code change. No environment change. The test just decided to fail, and then decided not to.

As JetBrains defines it, flaky tests "return both passes and failures despite no changes to code or test itself." The causes range from race conditions and shared state to environment inconsistencies and hardcoded waits. The result is always same: your team stops trusting test suite, and real bugs start slipping through.

This isn't a minor annoyance. Google's engineering team found that 16% of their tests exhibit some flaky behavior, and 84% of pass-to-fail transitions in their CI system are caused by flakes, not actual bugs. Atlassian's engineering blog reports that flaky tests consumed over 150,000 developer hours per year in their Jira backend alone. Microsoft Research found that 13% of their CI test failures are flaky.

If your team has more than a few hundred automated tests, you have flaky tests. The question is whether you're managing them or just hitting "re-run."

What flaky tests actually cost your team

The direct cost is developer time. A Google study calculated that flaky tests consume roughly 2% of total coding time across company. For a team of 50 developers, that's one full-time engineer's annual output lost to investigating tests that aren't actually broken. At $150/hour fully loaded (a conservative number for US-based engineers in 2026), that's $120,000+ per year in wasted salary.

But indirect cost is worse: trust decay. When developers learn that a red CI build probably means "just another flake," they stop investigating failures entirely. They re-run pipeline reflexively and merge when it goes green. This is how real regressions slip into production disguised as flakiness. Slack's engineering team documented this exact pattern. Their build failure rate hit 57% before they implemented systematic flake management, with most developers assuming every failure was a flake.

The compounding effect looks like this: flaky tests erode trust, eroded trust leads to skipped investigations, skipped investigations lead to escaped bugs, and escaped bugs lead to production incidents that cost 10-100x more to fix than catching them in CI.

The 5 root causes of flaky tests

1. Race conditions and async timing

This is most common cause, responsible for roughly 45% of all test flakiness according to Luo et al.'s foundational taxonomy of flaky test root causes. The test expects an element or value to be available, but it isn't ready yet because an API call hasn't returned, a DOM element hasn't rendered, or a JavaScript event handler hasn't attached.

The classic symptom: a test that clicks "Submit" and immediately checks for a success message. On a fast machine, API returns in 100ms and test passes. On a loaded CI runner, API takes 800ms and test fails because it checked too early.

    
javascript
 // BAD: hardcoded wait that guesses how long to wait await page.click('#submit'); await new Promise(r => setTimeout(r, 2000)); // hope 2 seconds is enough expect(await page.textContent('.banner')).toBe('Success');// BETTER: wait for actual condition await page.click('#submit'); await page.waitForSelector('.banner', { timeout: 10000 }); expect(await page.textContent('.banner')).toBe('Success');

The sleep(2000) approach is a confession that you don't know when event will complete, so you're guessing. When guess is wrong, test flakes.

2. Shared state between tests

The second most common cause, and hardest to diagnose. Test A creates a user record. Test B reads that record. When they run in order, both pass. When runner parallelizes them or shuffles order, Test B fails because Test A hasn't run yet.

The fix is test isolation: each test sets up its own data, runs against it, and tears it down. No test should depend on state left behind by another test.

3. Environment differences

The test passes on a developer's MacBook but fails on a Linux CI runner. Different OS, different timezone, different locale, different available memory, different CPU speed. Any of these can cause subtle behavior differences. A date formatted as "05/12/2026" on a US-locale machine appears as "12/05/2026" on a UK-locale CI server. The assertion fails.

The fix is environment parity: run tests in same containerized environment everywhere (Docker, Kubernetes), and never depend on locale, timezone, or filesystem assumptions.

4. Test order dependency

Some tests implicitly depend on running after a specific test that sets up state. Shuffling order breaks them. This is related to shared state but subtler: dependency might be a database row created by a previous test, a cookie set by a login test, or a file written to disk.

Research from Lam et al. (ICSE 2020) found that 75% of flaky tests are flaky from moment they're committed. Running new tests in random order N times before merging catches most order-dependent flakes before they enter main suite.

5. External service dependencies

Tests that call real APIs, real databases, or real third-party services inherit those services' latency and availability. A payment gateway that responds in 200ms during day might respond in 2 seconds during peak hours. A staging database that's fast when empty slows down after a month of accumulated test data.

The fix is mocking or stubbing external dependencies in tests. Use a local database for each test run, and mock third-party APIs with predictable responses and latency.

How to detect flaky tests in your CI pipeline

You can't fix what you can't find. Here are three detection methods that work at scale.

Historical pass/fail tracking. The simplest approach: store every test result and look for tests that flip between pass and fail without corresponding code changes. Most CI platforms (Jenkins, GitHub Actions, GitLab CI, CircleCI) have plugins or built-in features for this. JetBrains TeamCity detects flaky tests automatically by comparing results across multiple runs.

N-run validation for new tests. Run every new test 10+ times before merging it into main suite. Google uses a system called "Reservoir" that runs new tests in a loop for a week before they become part of CI path. This catches 75% of flakes before they ever reach production CI, according to Lam et al.'s research.

Quarantine with expiry. When you identify a flaky test, move it to a quarantine group so it doesn't block main build while you investigate. But set a hard expiry: one week maximum. Without a deadline and an assigned owner, quarantine becomes a permanent graveyard of ignored tests. Slack learned this hard way and switched from hiding failures to disabling tests with tracking tickets, which drove their build failure rate from 57% down to under 4%.

Why flaky tests are worse on mobile

Everything above applies to web and backend testing. Mobile testing adds extra flakiness sources that web tests don't have.

Device hardware variability. The same test runs on a flagship phone with 12GB RAM and a budget phone with 3GB RAM. On flagship, app responds in 100ms. On budget phone, it responds in 600ms. A hardcoded wait that works on flagship fails on budget device. Multiply this across 20+ device/OS combinations and flakiness compounds fast.

OEM-specific rendering. Samsung's One UI, Xiaomi's MIUI, and stock Android all render same app slightly differently. Elements shift position. System fonts change. Status bar height varies. A pixel-precise assertion or a coordinate-based tap that works on a Pixel fails on a Samsung.

Unpredictable system popups. Permission dialogs, "rate this app" prompts, battery optimization warnings, app-update banners. These appear at different times on different devices and block test execution. A test that passes 9 out of 10 times might fail on 10th because a system popup appeared that test script didn't account for.

Emulator vs real device gaps. Tests that pass consistently on an emulator flake on real hardware because emulators don't accurately simulate GPU rendering, thermal throttling, battery state, or network conditions. A test that runs fine on an x86 emulator might flake on an ARM device because of architecture-specific timing differences.

In our experience with mobile engineering teams, baseline flakiness rate for traditional Appium-based mobile test suites runs around 15% (roughly 1 in 7 tests fail randomly on any given run). Teams we've worked with have cut that to under 5% by switching to an approach that removes two biggest flakiness sources: element selectors and hardcoded waits.

How to reduce flaky tests on mobile to under 5%

The structural fix for mobile flakiness isn't "write better selectors" or "add longer waits." It's removing layers that cause flakiness in first place.

Drizz takes this approach. Tests are written in plain English ("Tap Login, enter email, tap Submit, validate home screen"), and a Vision AI engine executes them by reading screen visually. There are no element selectors to break. The Vision AI uses adaptive wait logic powered by screen state detection instead of static timers, so it waits for actual UI to change instead of guessing how long to sleep. And a built-in popup agent runs in background during every test, automatically dismissing permission dialogs, update prompts, and other system interruptions before they block next step.

This runs on real Android and iOS devices, not emulators. That eliminates emulator-vs-real-device gap entirely. The same test runs across multiple devices and OS versions without device-specific scripts, and test self-heals when UI changes because it was never anchored to a selector in first place.

One team we work with went from spending 30% of sprint time on testing and triage to about 10% after switching. Another authored 20 new tests in a single day, something that would have taken weeks with scripted Appium automation. The flakiness rate dropped from roughly 15% to under 5%, which meant CI pipeline became trustworthy again and developers stopped reflexively re-running builds.

Flaky tests are a systemic problem, not a collection of individual bugs. You can chase them one by one with better waits and cleaner state management, and that'll help. But if your test foundation depends on brittle selectors and static timers running on emulators, you're playing whack-a-mole. The teams that got flakiness under control are ones that changed foundation.

FAQ

What are flaky tests?

Flaky tests are automated tests that pass and fail inconsistently without any changes to code being tested. The same test, run against same code, on same branch, produces different results on different runs. JetBrains defines them as tests that "return both passes and failures despite no changes to code or test itself."

What causes flaky tests?

The five most common causes are race conditions and async timing (test checks for a result before it's ready), shared state between tests (one test depends on data from another), environment differences (test passes locally but fails on CI), test order dependency (shuffling run order breaks tests), and external service dependencies (real APIs introduce unpredictable latency).

How much do flaky tests cost?

Google calculated that flaky tests consume about 2% of total developer coding time. For a 50-person team, that's roughly one full-time engineer's annual output. Atlassian reported over 150,000 developer hours per year consumed by flaky test investigation in their Jira backend. The indirect costs (eroded CI trust, escaped production bugs, delayed releases) are harder to quantify but often larger.

How do you detect flaky tests?

Track test results over time and flag tests that flip between pass and fail without code changes. Run new tests 10+ times before merging them into main suite. Quarantine known flaky tests so they don't block main build, but set a hard expiry (one week max) with an assigned owner. Most CI platforms have built-in or plugin-based flaky test detection.

Are flaky tests worse on mobile?

Yes. Mobile testing adds device hardware variability, OEM-specific rendering differences, unpredictable system popups, and emulator-vs-real-device gaps that web testing doesn't have. Traditional mobile test suites (Appium, Espresso, XCUITest) typically run at 85-90% reliability. Vision AI approaches that eliminate selectors and use adaptive waits can push that to 95%+.

What is a good flaky test rate?

Under 1% is target for mature teams. Google's engineering team works to keep their flake rate below 1.5% of total test executions. Most teams without active flake management run at 5-15%. For mobile specifically, under 5% is a realistic target with modern tooling, and under 2% is achievable with Vision AI approaches on real devices.

‍