Chapter 2.3 – Model Training vs Execution

Understanding the difference between training and inference, mapped to build time vs runtime concepts from automation.

Posted Jan 11, 2026 Updated Feb 3, 2026

Model Training vs Execution

By Ravi Joshi

17 min read

Chapter 2.3 – Model Training vs Execution

Model Training vs Inference: Build Time vs Runtime

In automation, I’m used to thinking about “build time” vs “runtime”:

The Two Phases of Machine Learning

Build time: When I write, test, and deploy Terraform configs, Ansible playbooks, or CI/CD pipelines
Runtime: When those configs actually execute and provision infrastructure

ML has a similar split, but the terminology threw me off initially:

Training: When the model learns patterns from data (build time)
Inference: When the trained model makes predictions on new data (runtime)

Engineer Insight: I kept reading about ML projects that “worked in training but failed in production.” That confused me until I realized: these are two completely different activities with different inputs, outputs, and resource requirements.

Once I mapped training/inference to build/runtime, it clicked.

flowchart LR
    subgraph Training["🔨 TRAINING PHASE (Build Time)"]
        direction TB
        T1["Historical Data<br/>(Features + Labels)"]
        T2["Learn Patterns"]
        T3["Trained Model<br/>(Artifact)"]
        T1 --> T2 --> T3
    end
    
    subgraph Inference["⚡ INFERENCE PHASE (Runtime)"]
        direction TB
        I1["New Data<br/>(Features Only)"]
        I2["Apply Patterns"]
        I3["Prediction<br/>(Label)"]
        I1 --> I2 --> I3
    end
    
    Training --> |"Deploy Model"| Inference
    
    style Training fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
    style Inference fill:#ffe1f5,stroke:#cc0066,stroke-width:2px

1. Training: The Build Phase

Model Training: How Machines Learn

Model training is the process where an ML model learns patterns from historical data by adjusting internal parameters to reduce prediction errors. This is the build phase—like writing and refining automation until it behaves correctly across environments.

flowchart TD
        A[Historical Data<br/>Features + Labels] --> B[Model Training Loop]
        B --> C[Trained Model Artifact]
        C --> D[Predictions on New Data]

Automation Analogy: Training a model is like iteratively improving your Terraform config. Your first draft is rough, but with feedback and testing, you refine it until it works for all environments.

Example: Iterative Automation (Terraform)

  
# First draft (untrained model)
resource "aws_instance" "web" {
        instance_type = "t2.micro"  # Too small for prod
}
# After feedback (trained model)
resource "aws_instance" "web" {
        instance_type = var.environment == "prod" ? "t3.large" : "t3.small"
        root_block_device {
                volume_size = var.environment == "prod" ? 100 : 20
        }
}

How Training Works: The Learning Loop

Now that we know what training is, let’s see how it actually happens:

flowchart TD
                Start(["Start Training"])
                Load["Load Training Data<br/>Features + Labels"]
                Init["Initialize Model<br/>Random Weights"]
                Predict["Make Predictions<br/>on Training Data"]
                Compare["Compare to Actual<br/>Calculate Error"]
                Adjust["Adjust Weights<br/>Reduce Error"]
                Check{"Error Low<br/>Enough?"}
                Save["Save Trained Model<br/>model.pkl"]
                End(["Model Ready"])
                Start --> Load --> Init --> Predict --> Compare --> Adjust --> Check
                Check -->|No| Predict
                Check -->|Yes| Save --> End
                style Start fill:#90EE90,stroke:#006400
                style End fill:#90EE90,stroke:#006400
                style Save fill:#FFD700,stroke:#B8860B
                style Check fill:#87CEEB,stroke:#4682B4

Key Insight: Training is a feedback loop — predict → measure error → improve → repeat.

What Goes Into Training (Inputs)

Training data: Features + labels (historical examples)
Validation data: Held-out data to check generalization
Hyperparameters: Learning controls (learning rate, depth, layers)
Model choice: Algorithm family (tree, regression, neural net)

Automation Analogy: Data = input variables Hyperparameters = pipeline knobs Algorithm = tool choice (Terraform vs Ansible vs Helm)

What Comes Out of Training (Outputs)

Trained model artifact: (model.pkl, model.h5)
Training metrics: (loss, accuracy, validation score)
Metadata: (version, dataset, hyperparameters)

Automation Analogy: Like a compiled Terraform plan + state + commit metadata.

Hyperparameter Tuning & Model Selection (How Models Improve)

Now the question becomes: How do we make the model better?

Two options we have:

Change the model type (Decision Tree → Random Forest → Neural Net)
Tune hyperparameters (depth, learning rate, number of trees)

Key insight: You don’t tune weights directly — training does that. You tune the learning process itself using hyperparameters.

If you skip tuning and selection, you don’t get a “bad” model — you get a model that looked good during training but fails in production.

What Goes Wrong Without Validation (The False Confidence Trap)

Most teams start like this:

 data → Train model → Evaluate on same data → Accuracy = 96% → Ship it

This works in demos. It fails in reality.

Failure mode: The model memorizes quirks of historical data instead of learning deployable patterns — classic overfitting.

Automation Analogy: That’s like testing Terraform only in dev and assuming prod will behave identically.

The Correct Data Separation Model (Training ≠ Validation ≠ Test)

So, to overcome overfitting/underfiting we split data into three sets:

Training = learning patterns
Validation = Tune hyperparameters and compare models
Test = Final verification before production

Automation Analogy Training set = Development environment (build & iterate) Validation set = Staging environment (evaluate candidates) Test set = Production smoke test (touch once)

Why Two Sets Are Not Enough (And How Teams Break This)

A common mistake:

Train → Test → Pick best → Ship

If you evaluate 12 models on the test set and pick the best one, you’ve leaked information from production into engineering decisions.

Failure mode: Your test set becomes a hidden validation set — and your production performance collapses.

Automation Analogy: That’s like running experiments directly in prod and picking the config that didn’t break — not engineering, just luck.

Real Workflow: Hyperparameter Tuning in Practice

Let’s walk through a realistic workflow for model selection and validation:

graph TB
        subgraph DATA["Deployment History Dataset"]
                All["All Records(10,000 Records)"]
                Train["Training Set<br/>70% (7,000)<br/>🔨 Learn Patterns"]
                Val["Validation Set<br/>15% (1,500)<br/>🧪 Tune & Compare"]
                Test["Test Set<br/>15% (1,500)<br/>🚦 Final Verification"]
        end

        All --> Train
        All --> Val
        All --> Test

        Train --> M1["Model A"]
        Train --> M2["Model B"]
        Train --> M3["Model C"]

        M1 --> Val
        M2 --> Val
        M3 --> Val

        Val --> Winner["Best Candidate"]
        Winner --> Test

Workflow Steps:

Split data into training, validation, and test sets
Train multiple candidate models
Evaluate on validation set and select the best
Retrain winner and test once on the test set

(For a full end-to-end deployment risk example, see Section 6.)

Key Insight: Validation chooses the model. Test verifies generalization. Mixing these destroys trust in metrics.

Cross-Validation (When Validation Data Is Too Small)

Sometimes you get into situation where you don’t have enough validation data to trust one split — especially with small datasets.

Cross-validation solves this by rotating validation folds.

flowchart TD
        Data["Training Data (2,000 Samples)<br/>Split into 5 Folds"]

        subgraph I1["Iteration 1"]
                T1["Train: Folds 2–5"]
                V1["Validate: Fold 1<br/>Score: 87%"]
                T1 --> V1
        end

        subgraph I2["Iteration 2"]
                T2["Train: Folds 1,3–5"]
                V2["Validate: Fold 2<br/>Score: 89%"]
                T2 --> V2
        end

        subgraph I3["Iteration 3"]
                T3["Train: Folds 1,2,4,5"]
                V3["Validate: Fold 3<br/>Score: 85%"]
                T3 --> V3
        end

        subgraph I4["Iteration 4"]
                T4["Train: Folds 1–3,5"]
                V4["Validate: Fold 4<br/>Score: 88%"]
                T4 --> V4
        end

        subgraph I5["Iteration 5"]
                T5["Train: Folds 1–4"]
                V5["Validate: Fold 5<br/>Score: 86%"]
                T5 --> V5
        end

        Data --> I1 & I2 & I3 & I4 & I5
        V1 & V2 & V3 & V4 & V5 --> Avg["Average Score: 87%"]

Instead of trusting one validation split, you average performance across 5 independent tests.

Automation Analogy: Instead of testing in one staging environment choose multiple:
staging-us-east, staging-us-west, staging-eu, staging-ap. If it passes everywhere, you trust it.

Tradeoffs of Cross-Validation

Benefit: Much more reliable performance estimate
Cost: ~5× training time

Use when:

Dataset is small
Production risk is high
Algorithm selection matters

Avoid when:

Dataset is huge
Training cost is extreme
You already have stable validation splits

There Is No Universal Best Model (And Why That Matters Operationally)

Warning: There is no globally best ML algorithm. Some datasets favor trees, others neural networks, others linear models. You cannot know in advance — only empirical validation reveals the right choice.

Automation Analogy: Terraform ≠ best for everything Kubernetes ≠ best for everything Ansible ≠ best for everything Same truth for ML models.

Quick Summary
Training is an iterative feedback loop: predict → measure → improve
Hyperparameters tune learning behavior, not learned weights
Validation is for model selection; test is for final verification
Cross-validation improves reliability when data is scarce
There is no universally best algorithm — empirical testing wins

2. Inference: The Execution Phase

Model Inference: How Models Make Predictions

Inference is when you use the trained model to make predictions on new, unseen data. This is the execution phase—like running your automation in production with real inputs.

flowchart LR
        Input["New Deployment<br/>Request"]
        Features["Extract Features<br/>files_changed: 200<br/>environment: prod<br/>time: 14:00<br/>team: Platform"]
        Load["Load Trained<br/>Model"]
        Predict["Apply Learned<br/>Patterns"]
        Output["Prediction<br/>Risk: High<br/>Confidence: 91%<br/>Action: Require Approval"]
        Input --> Features --> Load --> Predict --> Output
        style Input fill:#e1f5ff,stroke:#0066cc
        style Output fill:#ffe1f5,stroke:#cc0066
        style Load fill:#FFD700,stroke:#B8860B
        style Predict fill:#90EE90,stroke:#006400

Automation Analogy: Inference is like running terraform apply with new variables—same logic, different inputs, real-world impact.

Example:

  
# Training = Writing this config
resource "aws_instance" "web" {
  instance_type = var.instance_type
}

# Inference = Running it with new values
terraform apply -var="instance_type=t3.large"
terraform apply -var="instance_type=t3.small"
terraform apply -var="instance_type=t3.medium"

How Inference Works (Step-by-Step)

Load trained model from artifact - Read model.pkl - Load learned weights into memory
Receive new input (features only) - files_changed: 150 - environment: “prod” - team: “Platform” - No label (you’re trying to predict it)
Apply learned logic to features - Model uses trained weights - Computes prediction
Return prediction - risk_level: “High” - confidence: 92% - Fast (milliseconds)

For a full end-to-end inference workflow in context, see Section 6: Running Example.

What Goes Into Inference (Inputs)

Trained model: The artifact from training
New data: Features only (no labels)

Automation Analogy: Trained model = compiled plan
New data = input variables at runtime

What Comes Out of Inference (Outputs)

Prediction: The predicted label (“High”, “Medium”, “Low”)
Confidence score: How confident the model is (optional)
Metadata: Model version used, timestamp (for auditing)

Automation Analogy: Prediction = output values
Confidence = test coverage or monitoring
Metadata = deployment logs

Key Differences: Training vs Inference

Training: Expensive, slow, happens offline, needs lots of data
Inference: Fast, lightweight, happens in real time, uses trained model

Tip: Optimize inference for speed and reliability—this is what runs in production!

Common Pitfalls in Inference

Using outdated models (not retrained)
Feeding in data with missing or unexpected features
Not monitoring prediction quality in production

Warning: Inference is only as good as the data and model you deploy. Always monitor and retrain as needed.

3. One-Time vs Continuous Learning

Batch Training: Learning from Historical Data Once

Most ML systems train once (or periodically) and then serve predictions continuously. This is like building your automation once and running it many times.

Example Workflow:

Monday:     Train model on 6 months of historical data
            └─ Takes 2 hours, produces model_v1.pkl

Tuesday-Sunday: Use model_v1.pkl for all predictions
                └─ Each prediction takes 50ms

Next Monday: Retrain on updated data (new week added)
             └─ Produces model_v2.pkl

Automation Analogy: You write your Terraform config once, then use it repeatedly for different environments.

  
# Write config once (training)
vim main.tf

# Apply many times (inference)
terraform apply -var="env=dev"
terraform apply -var="env=staging"
terraform apply -var="env=prod"

Online Learning: Continuous Model Retraining in Production

Some systems retrain continuously as new data arrives. This is less common and more complex, but essential for fast-changing environments.

Example Workflow:

Every hour:
├─ Collect new deployment data
├─ Add to training dataset
├─ Retrain model
└─ Deploy updated model

Automation Analogy: Like GitOps where every commit triggers a new deployment:
Code change → CI/CD pipeline → Auto-deploy
New data → Retrain model → Auto-deploy new model

When to Use Online Learning

Best for:

Fraud detection (patterns change rapidly)
Real-time recommendation engines
Stock price prediction

Usually not needed for:

Stable business processes
Most enterprise automation

Tip: Start with batch training. Move to online learning only if your data or requirements change rapidly.

4. When Should You Retrain Your Model?(Triggers & Schedules )

Your Terraform config doesn’t change unless requirements change. Models are similar—but with a catch: production data and requirements evolve!

Visual: Retraining Decision Flow (click to expand)

flowchart TD
    Monitor["Production<br/>Monitoring"]
    
    Check1{"Data Drift<br/>Detected?"}
    Check2{"Concept Drift<br/>Detected?"}
    Check3{"Accuracy<br/>< 85%?"}
    Check4{"New Features<br/>Available?"}
    
    Action["Trigger<br/>Retraining"]
    Collect["Collect New<br/>Training Data"]
    Retrain["Train New<br/>Model Version"]
    Validate["Validate on<br/>Test Set"]
    Deploy["Deploy New<br/>Model"]
    Continue["Continue<br/>Monitoring"]
    
    Monitor --> Check1 & Check2 & Check3 & Check4
    Check1 -->|Yes| Action
    Check2 -->|Yes| Action
    Check3 -->|Yes| Action
    Check4 -->|Yes| Action
    Check1 & Check2 & Check3 & Check4 -->|No| Continue
    
    Action --> Collect --> Retrain --> Validate --> Deploy --> Monitor
    
    style Monitor fill:#e1f5ff,stroke:#0066cc
    style Action fill:#FFD700,stroke:#B8860B
    style Deploy fill:#90EE90,stroke:#006400

Why Retrain? Four Key Triggers

1️⃣ Data Drift: Has Your Data Changed?

What it is: The real-world data distribution changes over time.

Example:

Original training data (2024):
├─ Average files_changed: 50
├─ Most deployments: 9am-5pm
└─ Teams: Platform, App, Data

New reality (2026):
├─ Average files_changed: 200 (microservices explosion)
├─ Most deployments: 24/7 (automation increased)
└─ Teams: Platform, App, Data, ML, Security, Compliance

The model was trained on old patterns. It won’t predict well on new patterns.

Automation Analogy: Requirements changed, so you update your Terraform config.

2️⃣ Concept Drift: Has the Meaning Changed?

What it is: The relationship between features and labels changes.

Example:

2024: Deployments to prod at night = High risk
        (Only emergencies happened at night)

2026: Deployments to prod at night = Low risk
        (Automated deployments now happen at 2am during low traffic)

The meaning of “night deployment” changed. The model needs to relearn.

Automation Analogy: Business logic changed, so you rewrite your validation rules.

3️⃣ New Features: Is New Data Available?

What it is: You have new data sources that could improve predictions.

Example:

Old features:
├─ files_changed
├─ environment
└─ team

New features available:
├─ code_review_approval_count
├─ test_coverage_percentage
├─ security_scan_results
└─ infrastructure_cost_impact

Retraining with these new features might improve accuracy.

Automation Analogy: New monitoring tools available, so you add health checks to your config.

4️⃣ Poor Performance: Is Accuracy Dropping?

What it is: Model accuracy degrades below acceptable thresholds.

Monitoring metrics:

Week 1: Accuracy 95% ✅
Week 4: Accuracy 92% ✅
Week 8: Accuracy 85% ⚠️
Week 12: Accuracy 78% ❌ Time to retrain

Automation Analogy: Your pipeline success rate drops, so you investigate and fix issues.

Retraining Schedules: When Should You Trigger It?

Approach	When to Use	Example
Time-based	Regular cadence	Retrain every Monday
Performance-based	When metrics degrade	Retrain if accuracy < 85%
Data-based	When enough new data collected	Retrain after 1000 new examples
Event-based	On significant changes	Retrain after major platform update

Engineer Insight: Start with time-based (weekly or monthly) and add performance monitoring for production systems.

5. ML vs Automation: Terraform, DevOps, and Ansible Equivalents

Show ML-to-Automation Table

ML Concept	Description	Terraform Equivalent	DevOps/CI/CD Equivalent	Ansible Equivalent
Training data	What you learn from	Requirements docs + past incidents	User stories, incident reports	Inventory, playbook vars
Training	Build phase	Writing and testing config	Building/test pipeline	Writing and testing playbooks
Trained model	Artifact with logic	`main.tf` (final config)	Final pipeline YAML	Final playbook
Model artifact	Versioned file	`terraform.tfstate`	Build artifact, release	Facts cache, output vars
Inference	Execution phase	`terraform apply`	Pipeline run/deploy	`ansible-playbook` run
Features	What you provide	Input variables	Pipeline parameters	Extra vars, host vars
Prediction	What you get back	Outputs	Pipeline output, deployment	Playbook output, registered vars
Retraining	Keeping logic current	Updating config for new requirements	Pipeline refactor, new release	Playbook update for new logic
Data drift	Need to update config	Requirements changed	New business rules, env change	Inventory/vars change
Batch inference	One prediction at a time	`terraform apply` once	One pipeline run	One playbook run
Real-time inference	Continuous predictions	GitOps auto-apply on commit	Auto-deploy on commit	Ansible Tower/AWX auto-run

6. Running Example: Deployment Risk Assessment

Full Running Example: Deployment Risk Assessment

Training Phase (One-Time or Periodic)

Collect historical deployment data (10,000 past deployments)
- Features: files_changed, environment, team, time_of_day, previous_failures
- Labels: actual_risk_level (Low/Medium/High)
Split data: 70% training, 15% validation, 15% test
Train model (e.g., Decision Tree)
- Learns patterns like: “IF files_changed > 100 AND environment == ‘prod’ THEN High Risk”
Validate model: Accuracy 89% on validation set
Save trained model: deployment_risk_model_v1.pkl

Inference Phase (Every Deployment)

New deployment request arrives (e.g., files_changed: 200, environment: “prod”)
Load trained model (deployment_risk_model_v1.pkl)
Make prediction (model applies learned patterns, takes 20ms)
Return prediction (risk_level: “High”, confidence: 91%, recommendation: “Require approval from senior engineer”)
Decision system uses prediction (deployment paused, approval required)

Retraining Trigger (After 1 Month)

Monitoring detects accuracy drop, new data, or data drift
Retrain model on updated dataset (e.g., 12,000 deployments)
New accuracy: 90%, deploy new model version

Retraining Trigger (After 1 Month)

Monitoring detects: - Accuracy dropped to 82% (was 89%) - 2,000 new deployments collected - Data drift detected (more microservices deployments now)

Action: - Retrain model on updated dataset (12,000 deployments now) - New accuracy: 90% - Deploy deployment_risk_model_v2.pkl

Engineer Insight: This end-to-end flow shows how ML models are built, deployed, and kept current in real-world automation. Monitor, retrain, and always close the loop between predictions and outcomes!

7. Practical Implications for Production

Machine learning in production is not just about code—it’s about infrastructure, resource planning, and operational discipline. Here’s what you need to know to run ML reliably at scale.

Training Environment: What’s Needed?

Training:

Happens offline (doesn’t block production)
Resource-intensive (use dedicated compute)
Versioned and tracked (like code releases)
Requires validation before deployment

Typical setup:

├─ High-end GPU/CPU instances (intermittent use)
├─ Large storage for training data
├─ Batch processing framework (Spark, Airflow)
└─ Experiment tracking (MLflow, Weights & Biases)

Inference Environment: What’s Needed?

Inference:

Happens in real-time (latency matters)
Lightweight (optimized for speed)
Uses production infrastructure
Needs monitoring and fallback logic

Typical setup:

├─ Lower-spec instances (always-on)
├─ Model registry for versioned artifacts
├─ API gateway for predictions
└─ Monitoring and alerting

Automation Analogy: The difference between training and inference environments in ML is just like the distinction between CI/CD build agents and production servers in automation.

Training = CI/CD build agents (bursty, powerful)
Inference = Production application servers (steady, efficient)

Practitioner’s Lesson: Plan for both training and inference as first-class citizens in your infrastructure. Don’t let your production ML fail because you only optimized for one phase!

What I Wish I Knew Earlier

Practitioner’s Lessons:
Training ≠ Inference: Training learns patterns from historical data. Inference applies learned patterns to new data.
Training is build time, inference is runtime: Like writing Terraform config vs running terraform apply.
Training is expensive, inference is cheap: Train once (or periodically), predict millions of times.
Models are artifacts: Trained models are files that need versioning, deployment, and monitoring.
Retraining is necessary: Data drift, concept drift, and new features require periodic retraining.
Monitor model performance in production: Track accuracy, latency, and prediction distribution to know when to retrain.
Batch vs online learning: Most systems use batch (periodic) training. Online learning is complex and rarely needed.

What’s Next?

➡ Series 3 – Chapter 3.0: ML Project Workflow

In the next chapter, we’ll explore:

The complete ML project workflow from problem to production
The seven phases of ML projects
How all the concepts we’ve learned fit together
End-to-end example mapping to automation workflows

Architectural Question: How do the phases of an ML project map to automation workflows, and what are the key checkpoints for success?

We’ve covered the fundamentals—data, features, labels, training, and inference. Now we’ll see how to orchestrate these pieces into a real ML project.

ai, ml

This post is licensed under CC BY 4.0 by the author.

Model Training vs Inference: Build Time vs Runtime

1. Training: The Build Phase

Model Training: How Machines Learn

How Training Works: The Learning Loop

What Goes Into Training (Inputs)

What Comes Out of Training (Outputs)

Hyperparameter Tuning & Model Selection (How Models Improve)

What Goes Wrong Without Validation (The False Confidence Trap)

The Correct Data Separation Model (Training ≠ Validation ≠ Test)

Why Two Sets Are Not Enough (And How Teams Break This)

Real Workflow: Hyperparameter Tuning in Practice

Cross-Validation (When Validation Data Is Too Small)

Tradeoffs of Cross-Validation

There Is No Universal Best Model (And Why That Matters Operationally)

2. Inference: The Execution Phase

Model Inference: How Models Make Predictions

How Inference Works (Step-by-Step)

What Goes Into Inference (Inputs)

What Comes Out of Inference (Outputs)

Key Differences: Training vs Inference

Common Pitfalls in Inference

3. One-Time vs Continuous Learning

Batch Training: Learning from Historical Data Once

Online Learning: Continuous Model Retraining in Production

When to Use Online Learning

4. When Should You Retrain Your Model?(Triggers & Schedules )

Why Retrain? Four Key Triggers

Retraining Schedules: When Should You Trigger It?

5. ML vs Automation: Terraform, DevOps, and Ansible Equivalents

6. Running Example: Deployment Risk Assessment

7. Practical Implications for Production

What I Wish I Knew Earlier

What’s Next?

Trending Tags