Measuring ROI on enterprise AI training: the metrics that matter and the ones that don't

A 95% completion rate on a self-paced AI module tells you people clicked through the slides

Ijan Kruizinga·

The metrics that don't matter (but everyone reports anyway)

Let's name the offenders.

Completion rates. A 95% completion rate on a self-paced AI module tells you people clicked through the slides. It does not tell you they can write a useful prompt, structure a Copilot workflow, or spot a hallucinated citation in a generated report. Completion is a hygiene metric. Treat it like one.

Satisfaction scores. Smile sheets are correlated with how good the trainer's jokes were, not with whether anyone changed how they work. Kirkpatrick called this Level 1 in 1959 and the industry has been stuck there ever since. The Association for Talent Development's research on training evaluation has shown for years that Level 1 reactions are a poor predictor of behaviour change.

Hours of training delivered. This is an input, not an output. A program that delivers 40 hours of training and changes nothing is worse than a program that delivers 4 hours and changes how an entire team works. Hours measure cost, not value.

Number of people trained. Volume without capability is a procurement metric, not a learning metric. If you trained 5,000 people and 200 of them are now using AI in their daily work, you didn't train 5,000 people. You trained 200, and 4,800 watched a video.

Engagement analytics. Minutes watched, modules completed, badges earned. None of this tells you what the learner can now do that they couldn't do before.

If your AI training dashboard is built primarily on these metrics, you're not measuring ROI. You're measuring activity, and presenting it as outcome.

What ROI actually means for AI training

Return on investment requires two real numbers: the investment, and the return. The investment is easy: program cost, learner time, internal coordination overhead. The return is the hard part, and it's where most L&D teams give up and retreat to vanity metrics.

The return on AI training has to be measured at three levels, in this order.

1. Capability uplift (can they do the thing?)

Before training, can the learner perform the target task? After training, can they? This is not a survey question. It's a demonstration.

For a Copilot rollout, the test might be: given a real meeting transcript, produce a structured action register with owners and dates in under five minutes. For an AI engineering cohort working on Databricks, it might be: given a poorly performing pipeline, identify the bottleneck and propose a remediation. For a fraud and scam awareness program, it might be: given six communications including two deepfakes, correctly flag which are which and explain why.

This is what we mean by AI capability uplift. It's measurable, it's specific, and it's the floor of any ROI conversation. If you can't show that learners can now do something they couldn't do before, nothing else you measure matters.

2. Behaviour change (are they doing the thing?)

Capability is necessary but not sufficient. Plenty of people have the capability to use AI well and still don't, because the workflow doesn't reward it, the manager doesn't model it, or the tool sits behind three approval steps.

Behaviour metrics that actually matter:

  • License utilisation by team and individual (active use, not seats assigned)

  • Frequency of AI-assisted task completion in real workflows

  • Self-reported and manager-observed changes in how work gets done

  • Reduction in time-to-completion on target tasks

  • Volume of AI-generated artefacts that survive review and reach production

For a Microsoft Copilot deployment, the single most important behaviour metric is the gap between licenses purchased and licenses actively used. If you bought 10,000 seats and 1,800 people are using them weekly, your training didn't fail; your adoption design failed. But the training is implicated.

3. Business outcome (did it produce a result?)

This is what the CFO is actually asking about. Capability and behaviour change are leading indicators. The lagging indicator is the business result the training was commissioned to produce.

Tie every program to a specific outcome metric before it starts. Examples we've seen work:

  • Time saved per week on a defined high-volume task, multiplied by headcount

  • Reduction in handle time for customer service teams using AI assistance

  • Increase in pipelines, reports, or analyses delivered per analyst per quarter

  • Reduction in audit findings related to AI risk and governance gaps

  • Fraud incidents prevented or losses avoided after a scam awareness program

  • Time-to-pilot reduction for AI engineering teams

If you can't write down the business outcome metric before the training starts, you're going to struggle to prove ROI after it ends. This is the hardest part of the design conversation, and it's the one most L&D teams skip because the business sponsor hasn't told them what success looks like. Push back. If the sponsor can't articulate it, that's a conversation to have before any contract is signed.

How to actually measure it

Three practical things to put in place before the program starts.

Baseline before you train. Run the capability assessment before the program, not after. Most enterprises only test post-training and report the score as evidence of success, but without a baseline, the score is meaningless. A pre/post delta is the minimum viable measurement.

Instrument the workflow, not the LMS. The LMS will tell you who completed what. The workflow will tell you whether anything changed. Talk to your platform owners, your IT team, and your business unit leads about what telemetry already exists. License utilisation reports, ticket resolution times, code commit frequency, document creation rates: most of this data is already being captured somewhere. You don't need to build a new dashboard. You need to connect the existing ones to the training program.

Sample, don't survey. Surveys at 90 days have a 12% response rate and tell you what people remember about how they feel. Pick a sample of 30 to 50 learners, watch them work, and ask them to walk you through how they're using AI on a real task. You'll learn more in three hours of observation than in three months of survey data.

The honest version of an ROI report

A defensible AI training ROI report has four sections: the business outcome it was commissioned to produce, the baseline and post-training capability scores, the behaviour data from the workflow, and the calculated dollar return based on time saved, errors avoided, or revenue enabled. If any of those sections is missing, the report is incomplete, and the reader should treat the conclusion with appropriate skepticism.

This is harder than running a satisfaction survey. It requires the L&D function to operate more like a product team than a course factory: define the outcome, instrument the experience, measure the change, iterate. Most L&D functions aren't set up for this yet. The ones that get there will be the ones whose AI programs survive the next budget cycle.

The CFOs are coming with sharper questions. The teams that can answer them with real numbers will keep getting funded. The teams that hand over a satisfaction score and a completion rate will find their AI training budget reallocated to something the business can actually measure.

If you want help designing programs that produce measurable capability and outcome change, that's what we build.

Ijan Kruizinga

Co-founder of Better People. 20+ years across technology and marketing leadership. Previously CEO of Crucial, CEO/COO of OMG and Jaywing.

Ready to talk?

30-minute discovery call.