The Decision Matrix: A New Framework for AI Evaluation

Why intelligent systems fail when we start with problems instead of choices

Jan 20, 2026

The pitch deck arrives via email. Fifteen slides of gradient backgrounds, customer logos you recognize, and a demo video showing an AI that “transforms how teams work.” The founder talks about “unlocking productivity” and “AI-powered insights.” The features sound impressive. The problem statement feels familiar. And yet, six months after signing the contract, the tool sits unused—another line item in the software graveyard, another executive sponsor quietly reassigned.

This pattern repeats across industries with numbing consistency. Not because the models are weak. Not because the founders lack ambition. But because we’re still building intelligent systems the way we built SaaS platforms—starting with problems, use cases, and user needs. That approach worked when software automated human tasks. It breaks when software makes decisions.

The shift from SaaS to AI isn’t just technological. It’s structural. And most organizations haven’t made it yet.

What SaaS Taught Us to Ignore

The SaaS playbook ran on a simple logic: identify a workflow, make it faster, charge based on seats or usage. The human remained the decision-maker. The software was a supporting actor. When Salesforce helped you track a lead, you still decided whether to call them. When Slack delivered a message, you still chose how to respond. The system automated execution, not judgment.

That model broke down cleanly into use cases and user stories. “As a sales rep, I want to see all my open deals so that I can prioritize my week.” The problem was clear. The solution followed. And when the software made a mistake—a notification fired at the wrong time, a report miscalculated—the human caught it, corrected it, moved on.

AI doesn’t work that way.

AI doesn’t just speed up a process. It encodes judgment. It replaces, not assists, the moment of choice. And when it’s wrong, the consequences don’t stop at an annoyed user. They cascade into contracts signed under false assumptions, loans approved with hidden risk, operational decisions that compound across quarters.

The cost of error isn’t a bug report. It’s a business exposure.

Starting with a “use case” in AI is like building a house without checking the foundation. It focuses attention on what the system does, not what it decides. And if you can’t name the decision, you can’t evaluate the risk.

The Decision-First Principle

Three questions separate signal from noise faster than any technical audit:

What decision does this improve or replace?

Not “helps teams collaborate.” Not “optimizes workflows.” What specific choice does the AI influence or make? Be ruthlessly concrete. “Should we approve this loan application?” “Does this contract clause require legal review?” “Should we engage this deal more aggressively?” If the answer sounds like a capability rather than a choice, you’re looking at vaporware.

Who owns that decision today?

If no one owns it now, AI won’t fix that. Ownership means accountability plus consequences. It means someone has skin in the game when the decision goes wrong. Gong succeeded because sales leaders already owned pipeline risk decisions. It didn’t invent a new behavior. It sharpened an existing one. Products that catch on almost always do.

What happens when the AI is wrong?

Not “it learns and improves.” What happens right now, in the real environment where this operates? What’s the cost to the business? The regulatory exposure? The erosion of customer trust? If the vendor hasn’t thought through failure modes with the same rigor they applied to the demo, they’re not building a product. They’re building a science project.

These three questions force a reckoning most vendors aren’t prepared for. They reveal whether the AI is grounded in reality or floating in abstraction. They surface the difference between a system designed to work and one designed to sell.

But identifying the decision is only half the shift. The other half—the part almost no one discusses—is understanding what kind of intelligence that decision actually requires.

The Intelligence Mismatch

In the SaaS era, we had one type of problem and one type of solution: automate the workflow, deliver the feature. The intelligence lived entirely in the human. The software was infrastructure.

In the AI era, three distinct types of intelligence operate simultaneously inside every organization:

Human intelligence—context, judgment, ethics, intuition. The capacity to navigate ambiguity, weigh competing values, override systems when circumstances demand it. Slow, expensive, irreplaceable.

Statistical and computational intelligence—optimization, pattern recognition, prediction, constraint solving. Fast, scalable, and brittle outside its domain. Excels when logic is agreed upon, inputs are structured, and error costs can be bounded.

Generative artificial intelligence—reasoning, synthesis, adaptation across domains. Powerful for exploration, sensemaking, pattern surfacing. Dangerous when treated as authoritative in high-consequence decisions.

Each form of intelligence has strengths. Each has failure modes. And critically, each is suited to different types of decisions.

The fatal mistake most organizations make is treating all decisions as equivalent and throwing the same type of intelligence at every problem. They deploy LLMs where algorithmic optimization would suffice. They ask humans to review outputs that computational intelligence should handle deterministically. They build custom AI where a well-tuned heuristic would outperform at a fraction of the cost.

This isn’t just inefficient. It’s structurally unstable.

The Method: Diagnosing Decision-Intelligence Fit

Here’s what the SaaS playbook never required: a systematic way to diagnose whether a decision structure can actually support the intelligence type you’re about to deploy.

Most evaluations still stop at “does the demo look good?” The method that matters runs deeper—four steps that expose misalignment before it becomes expensive.

Step 1: Characterize the decision

Every decision sits somewhere on two dimensions that determine what kind of intelligence can safely scale it.

First, how contested is the logic? Do stakeholders agree on what “good” looks like, or is success defined differently across functions? If sales measures outcomes by customer satisfaction and finance measures by margin protection, any AI that optimizes for one will be overridden by the other.

Second, how costly is failure? Can errors be caught and corrected quickly, or do they compound into regulatory exposure, reputational damage, or irreversible customer loss?

These aren’t abstract questions. They have concrete implications.

Fraud detection: Logic is debated at the margins but broadly standardized. Merchants agree on what fraud looks like even if thresholds vary. Failure cost is high but bounded—false positives erode trust, false negatives cost money and regulatory standing.

Strategic capital allocation: Logic is deeply contested. Every leadership team weighs risk, competitive position, and institutional memory differently. Failure cost is catastrophic and difficult to detect early.

The decision structure determines what comes next.

Step 2: Match intelligence type to decision structure

Once you’ve characterized the decision, the intelligence match becomes clear.

Standardized logic + bounded error + authoritative data = statistical and computational intelligence, embedded in workflow, with clear escalation paths.

This is Stripe Radar territory. Transactional data is real-time and proprietary. The encoded logic aligns with merchant consensus. The system sits directly in the payment flow, triggering immediate action. Edge cases escalate to human review. The intelligence type fits the decision structure.

Contested logic + contextual judgment + incomplete data = human intelligence remains primary, with AI used selectively for scenario modeling and pattern surfacing.

This is strategic planning territory. The AI provides inputs—market signal analysis, probabilistic forecasting, anomaly detection—but the final decision stays in human hands because the judgment required is irreducibly contextual. Attempts to automate this decision don’t fail because models are weak. They fail because stakeholders haven’t agreed on decision criteria, and no model performance can resolve that.

Exploratory synthesis + tolerance for revision + low immediate consequence = generative AI for sensemaking and option generation.

This is early-stage research territory, content ideation, hypothesis formation. The system helps humans think differently, not make final calls. Outputs require validation. The workflow expects human oversight.

The pattern holds: match breaks, systems stall. Match holds, adoption accelerates.

Step 3: Audit the system dependencies

Even when intelligence type aligns with decision structure, deployment still fails if the surrounding system can’t support what the AI actually needs.

Five dependencies determine whether intelligence can land:

Data authority—Is the data the system relies on actually authoritative, or is it a proxy? Where does truth live today, and does the AI have access to it? If the system depends on stale, incomplete, or contested data, the decision will be structurally unreliable regardless of model sophistication.

Integration into action—Does this live where work happens, or is it bolted on? Does it trigger real execution, or generate insights that go nowhere? UiPath didn’t win by offering insights. It won by executing work inside existing systems. Insight without action becomes shelfware.

Enablement cost versus value—What must change in daily workflows for this to deliver value? If adoption depends on extensive data cleanup, new infrastructure, retraining entire teams, or redesigning incentives, the vendor is underpricing implementation risk. You inherit that burden.

Decision logic alignment—Do stakeholders actually agree on the encoded logic, or will competing definitions of success lead to constant overrides? A logistics AI that optimizes delivery cost over customer satisfaction will be fought by any sales team compensated on satisfaction scores. That’s not an AI problem. It’s a leadership problem the AI exposed.

Decision legitimacy—Can the organization let this judgment scale, or will cultural resistance, threatened roles, and power dynamics create friction? Can the decision be explained, audited, and defended under regulatory or legal scrutiny? If the answer is no, the system won’t survive first contact with reality.

These aren’t edge cases. They’re where most AI products silently stall—not because the technology failed, but because the environment was never ready.

Step 4: Pressure-test against failure modes

The final step is the one SaaS never required: explicitly mapping what happens when the intelligence gets it wrong.

For each decision the AI touches, trace the failure path:

What breaks immediately?
What compounds over time?
Who bears the cost—customer, employee, business, regulator?
Can the damage be contained, or does it cascade?
Is there a human in position to catch it before consequence?

If the vendor hasn’t thought through failure modes with the same depth they applied to the demo, walk away. You’re not evaluating a product. You’re funding a bet that nothing will go wrong.

The Matrix No One Draws

Put the four steps together and you have something the market hasn’t built yet: a defensible method for evaluating whether an AI system is actually ready to deploy.

Not “does the model work?” but “does the decision structure support the intelligence type, and does the environment provide what that intelligence requires to operate safely?”

Stripe Radar passes this test cleanly. The decision—approve or block transaction—is concrete and owned. The logic is standardized. The intelligence type—statistical pattern recognition on real-time transactional data—matches the decision structure. The system integrates directly into payment flow. Failure modes are bounded and escalate appropriately. Decision legitimacy is high because merchants already trust this judgment.

Generic “AI strategy copilots” fail this test immediately. The decision is vague or non-existent. The logic is contested across stakeholders. The intelligence type—generative synthesis—is mismatched to high-stakes allocation choices. The system bolts onto existing planning processes without changing how work gets done. Failure modes are invisible until quarters later. Decision legitimacy collapses the first time the AI recommends something leadership disagrees with.

The difference isn’t model quality. It’s structural fit.

The Discipline the Market Hasn’t Learned

We’re early enough that most failures still get blamed on “change management” or “adoption challenges.” That language obscures the real problem.

The issue isn’t user resistance. It’s that the system was misaligned from the start—built around a vague problem instead of a concrete decision, matched to the wrong intelligence type, deployed into an environment that couldn’t support what it needed to work.

The discipline required now is harder than what SaaS demanded. It requires executives to name decisions they’d prefer to keep ambiguous. It requires product leaders to refuse features that don’t map to clear choices. It requires investors to walk away from impressive demos that can’t answer basic questions about decision ownership, intelligence-type fit, and failure containment.

It requires starting not with what AI can do, but with what decision it’s meant to improve—and whether that decision structure is even ready for intelligence to scale it.

Most aren’t.

And that’s the work.

Systems & Spines

Discussion about this post

Ready for more?