Before the Model, There Was the System

What the first AI system I built taught me about trust, constraint, and why scale isn’t intelligence

Dec 30, 2025

This week I want to tell you a story about the very first AI use case I ever worked on.

The year was 2017. And yes, before anyone asks, there were people working seriously in AI back then. Some of them were women.

I’m starting here not out of nostalgia, but because this use case has stayed with me. It has become more relevant, not less, as the conversation around AI has narrowed and accelerated. When people talk about AI today, they’re almost always talking about large language models—chat, content, general-purpose systems trained on massive, undifferentiated corpora of human output.

LLMs are one manifestation of this technology. They are not the entirety of it. And they are certainly not the only way to build useful, responsible, or economically meaningful systems.

The first AI system I worked on had a much more modest ambition: to help organizations identify bias in employee performance reviews.

Not to correct them automatically. Not to override managers. Not to make decisions on behalf of anyone. The goal was simply to surface patterns that were difficult for humans to see consistently, especially at scale—gendered language, tone discrepancies, situations where two reviews carried similar sentiment but resulted in different performance ratings.

The logic was straightforward. Performance reviews sit upstream of everything that matters inside an organization. They shape pay, promotion, potential, and influence long before anyone looks at compensation data. If bias is embedded in the language used to describe people’s work, then equity efforts that begin at pay are already too late.

What mattered most in that system was not the model itself. It was how we chose to build it.

What we built first was not a model. It was a definition of reality.

What we were really doing—though we didn’t use this language at the time—was defining a clear ontology before we ever trained a model. We were explicit about what existed in the system and what did not. What counted as signal versus noise. What distinctions mattered, and which ones were intentionally out of scope. Bias was not a vague concept we hoped the system would intuit from the internet. It was something we defined, grounded in expert judgment, and constrained to a specific organizational context.

We did not train the system on the entire internet. We did not ingest every piece of human-generated text we could get our hands on. And we certainly did not assume that more data would somehow neutralize bias.

The opposite was true. The broader the corpus, the more bias you inherit. The internet is not a clean dataset. It’s a record of power, exclusion, shorthand, and cultural drift. Training a system intended to detect bias on a dataset saturated with it would have undermined the entire effort.

So we took a different approach.

We hired domain experts—people with deep backgrounds in linguistics, language, and HR. People who understood not just how language functions grammatically, but how it operates inside organizational hierarchies and decision-making structures. We asked them to work directly with real performance reviews, annotating phrases that signaled bias, diminishment, or inconsistency. We asked them to apply judgment, not rules.

That expert-labeled dataset became the foundation of the system.

From a technical standpoint, there was nothing exotic about the model—but the discipline was. This was classic natural language processing and statistical pattern recognition—sentence-level analysis, phrase weighting, context-aware scoring. The system learned patterns from expert judgment and applied those patterns to new reviews. It flagged language. It highlighted discrepancies. It made its reasoning legible.

Crucially, it did not pretend to be objective. It did not hide behind scale or abstraction. It was explainable by design because its ontology was explicit. I could tell you who labeled the data. I could tell you why a phrase was flagged. I could walk a skeptical executive through exactly what the system was doing and, just as importantly, what it was not doing. The model augmented human judgment—it did not replace it. That distinction made legal, HR, and executive adoption possible.

That mattered more than we realized at the time.

The system worked. Not because it was large, but because it was disciplined. Not because it was novel, but because it was trusted. Organizations removed biased language from a meaningful share of performance reviews. Promotion and potential signals became more consistent. Economic outcomes improved in measurable ways.

Trust was not an abstract principle in that system. It was structural.

Trust wasn’t a belief. It was engineered.

Looking back now, what strikes me is not how primitive that work was, but how much of it we seem to have forgotten.

Somewhere along the way, we collapsed intelligence into scale. We started equating model size with sophistication and constraint with weakness. Ontology gave way to probability. When ontology is implicit, responsibility becomes diffuse. Instead of defining the structure of the domain up front, we began asking models trained on everything to reason about anything. Fluency became a proxy for understanding. Confidence substituted for constraint.

Today’s AI discourse is dominated by benchmarks, leaderboards, and weekly shifts in who appears to be “winning.” The conversation oscillates between awe and anxiety, but rarely slows down long enough to ask what kind of systems we are actually building.

The most effective AI system I’ve ever worked on was small, tightly scoped, and deeply constrained. It was built with respect for the domain, the data, and the humans involved. It was designed to earn trust, not demand it.

I’m not anti-LLM. I’m not arguing for a return to some simpler past. I’m arguing for memory.

Fluency is not judgment. Probability is not responsibility.

AI is not magic. It’s pattern recognition, probability, and systems design. And systems succeed or fail not based on how impressive they look in a demo, but on whether people trust them enough to integrate them into real decision-making.

If we’re serious about transformation in the coming years, it won’t come from ever-larger models alone. It will come from remembering how to build systems worthy of the responsibility we’re handing them—systems that are constrained where it matters, transparent where it counts, and grounded in human expertise rather than abstraction.

That’s not a step backward.

It’s the work.

📣 Personal Announcement - I’m trying something new this year. Teaching my work on Maven.

Over time, I’ve gravitated more and more toward the part of advisory work I’ve always valued most—teaching. Not teaching as in slides or lectures, but teaching as in sitting with product, engineering, and executive teams and helping them see what actually matters. Teaching judgment. Teaching craft. Teaching how to move forward when the data is imperfect, incentives collide, and the right answer is not obvious.

As the pace of change in AI accelerates, it’s become clear that the traditional ways of transferring this kind of practitioner knowledge are not keeping up. So I’m experimenting with more direct ways to share the methods and mental models I’ve used inside real organizations—without layers, without abstraction, and without diluting the work.

One expression of that experiment is a free, live 30-minute Lightning Lesson I’m hosting on Maven in January. You could call it my personal attempt to help us all have better meetings in 2026.

This session is designed for senior leaders who regularly operate inside high-stakes executive and product forums—steering committees, product reviews, and decision meetings where ambiguity is high and consequences are real. We’ll focus on how to create alignment faster, surface the right risks earlier, and drive decisions that actually hold once the meeting ends.

The Lightning Lesson is a working preview of how I teach—grounded in real enterprise scenarios, focused on judgment and decision quality, and immediately applicable to the meetings and initiatives you and your team members are already leading.

Join the Free 30-Min Lightning Lesson

Forwarded this essay? Subscribe here for more

This is a reader-supported publication. While all of my essays are currently free and will remain so for the foreseeable future, to encourage my writing, please consider becoming a paid subscriber.

Systems & Spines

Discussion about this post

Ready for more?