How to measure enterprise AI-training ROI without faking it

Every Head of Transformation eventually gets the question from the board: what's the return on all this AI training? And too often the answer that comes back — from the L&D function or, worse, the training vendor — is a number that sounds like ROI but isn't. "Satisfaction: 94%." "Completion: 88%." "Confidence rose 18 points." These are real numbers. None of them is a return.

A return is a change in the business — a process that got faster, an error rate that fell, a cost that came down, a revenue line that moved. "Confidence rose 18%" is a survey response. It tells you people feel readier. It does not tell you anything got better. The gap between those two sentences is where most AI-training budgets quietly go to die.

Why measurement theatre is the default

Measuring the return on capability is genuinely hard, and that difficulty is the root of the problem. The effect is lagged — behaviour changes months after the workshop. It's multi-causal — was it the training, the new tool, or the reorg? And it's diffuse — spread across dozens of small workflow improvements that never roll up into one clean line. Faced with something that hard to measure, the path of least resistance is to measure something easy instead. And reaction is the easiest thing in the world to measure.

That's exactly the trap Donald Kirkpatrick named back in 1959. His four levels of evaluation describe a ladder from the easy-and-meaningless to the hard-and-valuable.

The four levels of training evidence

04ResultsDid the business move? — the metric you agreed up front.

03BehaviourDid the work change? — applied evidence, real usage.

Value lives aboveTheatre lives below

02LearningDid they pass? — test scores, self-reported confidence.

01ReactionDid they like it? — satisfaction scores, completion rates.

After Kirkpatrick (1959); ROI added as a fifth level by Phillips.

Reaction and Learning sit at the bottom: cheap to collect, and weakly predictive of anything that matters. Behaviour and Results sit at the top, where value actually lives — and they're the levels almost everyone skips, because they're the levels that are hard. Jack Phillips later added a fifth, ROI proper, along with the discipline that matters most: isolating how much of the result the programme actually caused.

The reason this bites harder with AI than with most training: McKinsey's State of AI surveys have repeatedly found that while AI adoption is now widespread, only a minority of organisations can tie it to meaningful bottom-line impact. If the enterprise can't attribute value to its AI investment in general, a vendor waving a satisfaction score isn't closing that gap — it's papering over it.

How to measure it without faking it

The honest version isn't more complicated. It's just more disciplined — three moves.

Agree the business metric before the programme starts. Pick one or two things the capability is meant to move — cycle time on a process, an error or rework rate, time-to-proficiency, a deflection rate — and baseline them now. A return you only define afterwards is a story you tell, not a number you measured.

Measure behaviour, not reactions. Instrument whether the work actually changed: are people using the technique on real tasks, how often, with what applied output? Behaviour (Level 3) is the leading indicator that Results (Level 4) will follow — and it's the level where you can catch a failing programme in time to fix it, instead of waiting a year for a disappointing result.

Be honest about attribution. You will rarely get a clean, fully-isolated ROI figure, and anyone who hands you one to two decimal places is selling. Report the movement in the business metric, state plainly what else could have caused it, and use a defensible estimate of the programme's contribution. A credible range beats a precise fiction every time.

What a Head of Transformation can do this quarter

Refuse Level-1 metrics as ROI. Satisfaction and completion are hygiene checks — useful for spotting a broken programme, useless as a return. Don't let them into the board pack dressed as value.
Baseline before you train. Choose the business metric and capture today's number first, while you still can.
Instrument behaviour. Track usage and applied evidence on real work, not attendance and certificates.
Demand ranges, not miracles. A partner who reports a confidence interval is more trustworthy than one who reports a flawless figure.
Tie the next wave to the evidence. If behaviour isn't moving, change the design now — don't wait for the lagging result to confirm what the leading indicator already told you.

The uncomfortable truth is that real AI-training ROI is harder to produce, and less flattering, than the theatre. But it's the only version a board can act on. A confident-sounding percentage that measures nothing is worse than an honest range that measures something — because the first one quietly costs you the next decision.

That's why ASTRA Academy reports adoption, not completion: the platform measures whether the work changed, wave by wave, and every insight is sponsor-and-cohort level — never a vanity score.

Sources: Donald Kirkpatrick — Four Levels of Training Evaluation (1959) · Jack Phillips — ROI Methodology (Level 5, isolating programme effect) · McKinsey — The State of AI (annual survey series)

How to measure AI-training ROI without faking it

Why measurement theatre is the default

How to measure it without faking it

What a Head of Transformation can do this quarter

Why most enterprise AI training fails

Co-design beats off-the-shelf

Let's design something together.