The AI Engineering Mindset — Building Production Systems with LLMs

A different kind of engineering

Traditional software engineering is deterministic at its core. Given the same input, you expect the same output. Tests are binary — they pass or they fail. Deployments either work or they don't.

AI engineering breaks this assumption. When you integrate an LLM into a production system, you're working with a component that is:

Non-deterministic: the same prompt can produce different outputs
Expensive per call: orders of magnitude more costly than a database query
Latency-variable: response times range from milliseconds to tens of seconds
Opaque: you can't inspect the "reasoning" in a debuggable way
Evolving: model behavior changes with updates you don't control

This doesn't mean AI engineering is harder or easier than traditional engineering. It means the skills, patterns, and instincts are different.

The evaluation problem

In traditional software, you write a test:

expect(calculateTax(100, 'CA')).toBe(7.25)

In AI engineering, the equivalent is much harder:

const response = await llm.complete('Summarize this article...')
// What does "correct" mean here?
// How do you assert quality programmatically?

This is the central challenge. You need evaluation frameworks that are rigorous enough to catch regressions but flexible enough to accommodate the natural variation in model outputs.

The pattern that works best in practice is rubric-based evaluation: define a set of criteria, score outputs against those criteria, and track scores over time.

interface EvaluationRubric {
  criteria: EvaluationCriterion[]
  passingScore: number
}

interface EvaluationCriterion {
  name: string
  description: string
  weight: number
  scorer: (input: string, output: string) => Promise<number>
}

const summarizationRubric: EvaluationRubric = {
  criteria: [
    {
      name: 'completeness',
      description: 'Covers all key points from the source',
      weight: 0.3,
      scorer: async (input, output) => {
        // Extract key entities from input, check coverage in output
        const keyPoints = await extractKeyPoints(input)
        const covered = keyPoints.filter((p) =>
          output.toLowerCase().includes(p.toLowerCase())
        )
        return covered.length / keyPoints.length
      },
    },
    {
      name: 'conciseness',
      description: 'Significantly shorter than the original',
      weight: 0.2,
      scorer: async (input, output) => {
        const ratio = output.length / input.length
        if (ratio < 0.2) return 1.0
        if (ratio < 0.4) return 0.7
        return 0.3
      },
    },
    {
      name: 'accuracy',
      description: 'No hallucinated facts',
      weight: 0.5,
      scorer: async (input, output) => {
        // Use a separate model call to check factual consistency
        return await checkFactualConsistency(input, output)
      },
    },
  ],
  passingScore: 0.75,
}

This approach gives you a number you can track, alert on, and use in CI. It's not perfect — but it's far better than manual review or no evaluation at all.

Prompt engineering is API design

Treat your prompts as interfaces, not strings. A prompt is the API contract between your application logic and the model. It deserves the same rigor as any other API.

interface PromptTemplate<TInput, TOutput> {
  name: string
  version: string
  template: (input: TInput) => string
  parser: (raw: string) => TOutput
  examples: Array<{ input: TInput; expectedOutput: TOutput }>
}

const classifyIntent: PromptTemplate<
  { message: string },
  { intent: string; confidence: number }
> = {
  name: 'classify-intent',
  version: '2.1',
  template: ({ message }) => `Classify the user intent for the following message.

Respond with a JSON object containing "intent" and "confidence" (0-1).

Valid intents: question, feedback, complaint, request, other

Message: "${message}"

JSON response:`,
  parser: (raw) => {
    const parsed = JSON.parse(raw.trim())
    return {
      intent: parsed.intent,
      confidence: Math.min(1, Math.max(0, parsed.confidence)),
    }
  },
  examples: [
    {
      input: { message: 'How do I reset my password?' },
      expectedOutput: { intent: 'question', confidence: 0.95 },
    },
  ],
}

Versioning prompts, testing them against examples, and treating output parsing as a first-class concern are the practices that separate production AI systems from prototypes.

LLM calls are expensive. A system that makes a GPT-4-class call for every user interaction will have infrastructure costs that scale linearly with traffic — unlike traditional systems where costs scale sub-linearly with caching and optimization.

The key strategies:

Cache aggressively. If two users ask semantically similar questions, the answer is likely the same. Semantic caching — using embedding similarity to detect equivalent inputs — can reduce call volume by 30-60%.

Use the smallest model that works. GPT-4 is not always necessary. For classification, extraction, and formatting tasks, smaller models are faster, cheaper, and often just as accurate.

Batch when possible. Many LLM providers offer batch APIs at significant discounts. If your use case tolerates some latency, batch processing can cut costs by 50% or more.

Precompute where you can. If you're using an LLM to generate product descriptions, do it once at publish time — not on every page view.

Failure modes are different

Traditional systems fail with exceptions, timeouts, and error codes. AI systems have a more subtle failure mode: they return something that looks right but isn't.

This means you need guardrails:

async function safeLLMCall<T>(
  prompt: string,
  parser: (raw: string) => T,
  validator: (result: T) => boolean,
  retries = 2
): Promise<T | null> {
  for (let attempt = 0; attempt <= retries; attempt++) {
    try {
      const raw = await llm.complete(prompt)
      const parsed = parser(raw)

      if (validator(parsed)) {
        return parsed
      }

      // Output parsed but failed validation — retry with adjusted prompt
      continue
    } catch {
      // Parse failure — retry
      continue
    }
  }

  return null // All attempts failed — fall back gracefully
}

The validator function is the key. It encodes your business logic about what a valid output looks like. Without it, you're trusting the model completely — and that trust should always be verified.

What this means for your team

If you're integrating AI into a production system, your team needs to develop new muscles:

Comfort with probabilistic outputs. Not every response will be perfect. Define "good enough" explicitly.
Evaluation-driven development. Write evals before writing prompts, just as you'd write tests before code.
Cost awareness. Every engineer should understand the cost per call and the total cost per feature.
Graceful degradation. Every AI-powered feature should have a non-AI fallback.

AI engineering is real engineering. It just requires a different set of instincts.

Next in this series: building a production RAG system that actually works.

The AI Engineering Mindset: What Changes When You Build with LLMs

A different kind of engineering

The evaluation problem

Prompt engineering is API design

The cost equation

Failure modes are different

What this means for your team

Related Posts

I Spent $500 in a Week on AI-Assisted Coding. Here's What I Learned About Not Doing That.

The Line Coder Is Dead. Long Live the Engineer.

Creation Is Cheap Now. Creativity Is the Moat.