Self-learning systems · gitlab-org #21742 · @pedropombeiro

A self-learning loop
for flaky tests

Duo distills our development docs into context, shifts flaky-test fixes left, and closes its own knowledge gaps — getting better every cycle.

The problem

Today, every flaky test starts from zero

End-to-end tests flake on timing, selectors, and shared state. We retry, quarantine, and eventually fix them — but the fix lives in one merge request and one engineer's head. Duo and our agents never carry that hard-won context into the next review.

Knowledge stays tribal

How to stabilize a page object or wait correctly lives with reviewers — not where an agent can reach it.

Fixes land too late

Flakiness usually surfaces after merge, in nightly E2E runs — far from the change that introduced it.

Context never accrues

Each fix is one-off. Nothing distills it back into guidance the platform can reuse.

The idea

What if the platform
taught itself?

Turn every fix into durable context — then feed it straight back into review.

Step 01 · scheduled CI jobAutomated

Distill the docs into context Duo can use

A scheduled pipeline reads docs.gitlab.com/development, extracts our end-to-end testing conventions, and compiles them into artifacts the AI applies on every change.

It captures conventions like:

development docs

docs.gitlab.com/development

distill-agent-context

.gitlab-ci.yml · schedule: {{ distillSchedule }}

Agent skill

{{ skillName }} · SKILL.md

Review instructions

Step 02 · in the merge requestAutomated

Fix flakiness before it ever merges

Duo flags it inline — citing the distilled rule.

An agent proposes the fix in the same MR.

Flakiness is prevented, not just retried away.

Content generated by AI should be seen as a starting point and verified before use.

Merge request !240958 · reorder Capybara assertions

spec/features/admin/admin_sees_background_migrations_spec.rb

click_button 'Resume'

− expect(page).not_to have_button 'Resume'

− expect(page).to have_button 'Pause'

+ expect(page).to have_button 'Pause'

+ expect(page).not_to have_button 'Resume'

GitLab Duo Code Review

The negative matcher runs before the page settles, so Capybara polls the old button for the full 30-second timeout. Confirm the new state with a positive matcher first, then check for absence — per our testing guide.

Suggested fix applied by agent

Steps 03–04 · scheduled CI jobManually tested · needs automation

Mine the fixes, find the gaps, write the docs

A second scheduled job classifies merged flaky-test fixes by root cause and checks each recurring pattern against the testing docs — opening a merge request wherever guidance is missing.

Waiting on the right signal

testing-rspec.md

Query asserts in :js specs

testing-rspec.md

Timestamp ordering

testing-rspec.md

E2E waits & navigation

testing-e2e.md (new)

Gap identified

Assert UI before reload

most common cause

documented in

Distill flaky-test fix learnings into testing principles

merge request !241516 (adds a new testing-e2e principle)

Source: agent gap analysis of 64 fixed flaky-test issues, gitlab-org#603519

Why it works

Every fix makes the next review smarter

Self-reinforcing

Each merged fix becomes documented context, distilled into the next skill and review pass. Knowledge compounds instead of resetting.

Shift-left by default

Fixes move from nightly E2E failures into the merge request, where they are cheapest to make and easiest to learn from.

Runs itself

Two scheduled jobs do the distilling and the gap-finding. No manual curation backlog, no doc rot.

Close the loop

Pilot on the E2E browser suite, prove it closes once, then expand to every flaky-test bucket.

Work itemgitlab-org #21742

SurfaceDuo Code Review + agents

First bucketE2E browser pages

A self-learning loopfor flaky tests

Today, every flaky test starts from zero

Knowledge stays tribal

Fixes land too late

Context never accrues

What if the platformtaught itself?

Distill the docs into context Duo can use

Fix flakiness before it ever merges

Mine the fixes, find the gaps, write the docs

Every fix makes the next review smarter

Self-reinforcing

Shift-left by default

Runs itself

Close the loop

A self-learning loop
for flaky tests

What if the platform
taught itself?