Every engineering team adopting AI tools asks the same question: "Is it working?" Not "is the AI good?" but "is it actually making us better?"

Most teams answer this with vibes. "It feels faster." "Code reviews seem easier." "I think we're shipping more." These are not measurements. They're impressions, and impressions lie.

At Horizon, we use a single metric that tells us exactly how AI-ready our codebase is: the percentage of reported bugs that can be resolved with a single prompt.

The Metric

The idea is simple. When a bug is reported, you give it to an AI coding agent with one prompt: the bug description, the relevant error, and the app it affects. No additional context. No back-and-forth. No "try this file" or "check that service."

One prompt. Does the AI fix the bug correctly?

If yes, that area of your codebase is AI-ready. The documentation is clear, the architecture is unambiguous, and the model has everything it needs.

If no, something is missing. Maybe the app's AGENTS.md doesn't describe the pattern well enough. Maybe the relevant files are too large to fit in context. Maybe there's an implicit convention that isn't written anywhere. Every failure is a signal about your documentation or code structure.

Why Bugs Specifically

We chose bugs over features for a specific reason. Bugs are concrete. They have a clear before (broken) and after (fixed). There's no ambiguity about whether the output is correct. Either the bug is fixed and tests pass, or it isn't.

Features are harder to evaluate. "Build a dashboard" has many valid interpretations. "Fix the null pointer exception when a user has no conversations" has exactly one correct outcome.

Bugs also span the entire codebase naturally. Over time, bugs get reported in every app, every service, every layer. This gives you coverage without having to design test cases.

How We Track It

Every bug that comes through our ticketing system gets flagged. Before an engineer works on it, we run a quick experiment: hand the bug to the AI agent with one prompt. Record the outcome.

We track three categories:

Resolved in one prompt. The AI produced a correct fix with no human intervention beyond the initial prompt. This is the success case.

Resolved with guidance. The AI needed additional context, clarification about which file to look at, or correction after an initial wrong attempt. This means the documentation layer has gaps.

Not resolved by AI. The bug required human reasoning that the AI couldn't perform, either because the problem was too complex or because the codebase lacked the structure to guide the model.

We calculate the one-prompt resolution rate per app, per month. This gives us a heatmap of AI-readiness across the codebase.

What the Numbers Tell You

When we started tracking, our one-prompt resolution rate was around 20%. Most bugs needed at least one round of human guidance.

After restructuring our codebase into smaller apps, writing AGENTS.md files, and building out agent_docs, we saw the rate climb steadily. Apps with comprehensive documentation hit 60%+ resolution rates. Apps we hadn't documented yet stayed at 20%.

The correlation was obvious and immediate. Documentation quality directly predicts AI performance. Not model choice. Not tool choice. Documentation.

How We Use This to Prioritize

The metric isn't just a score. It's a decision-making tool.

When the one-prompt rate for an app drops, we investigate. Usually we find that a new pattern was introduced without updating the docs, or a refactor changed the architecture without updating AGENTS.md. The fix is always the same: update the documentation layer.

When we're deciding which app to invest in next for AI-readiness, we look at two things: which app has the most bugs? And which app has the lowest resolution rate? The intersection tells us where the highest ROI improvement lives.

We also use it during architecture reviews. When proposing a new pattern or refactoring an area, we ask: "Will this make the code more or less AI-readable?" If a proposed change makes things harder for AI to understand, we need a strong justification to proceed.

Getting Started

If you want to try this with your team, start small. Pick ten recent bugs across your codebase. Feed each one to your AI coding tool with just the bug description and the file path. Track how many get resolved correctly.

That number is your baseline. Then start improving your documentation, restructuring ambiguous areas, and measuring again. The improvement is usually fast and visible.

The beauty of this metric is that it's cheap to collect, hard to game, and directly actionable. A low score always points to a specific, fixable problem in your codebase or documentation. A high score means your investment in AI-readiness is paying off.

No vibes required.

Written by

Facundo Doneganá

Engineering Manager, Horizon

Facundo runs engineering operations at Horizon. He owns the maturity framework that measures AI-readiness in production: the share of bugs we can resolve with a single prompt.

Using Bugs as an Evaluation Framework for AI Coding Maturity