Post

Chapter 3.4 – Feature Engineering: The Real Work Behind ML

Feature engineering techniques explained for automation engineers. Where 80% of ML work actually happens. Part of the 'From Automation to AI' series.

Chapter 3.4 – Feature Engineering: The Real Work Behind ML

The Hidden Truth About Machine Learning


Key Insight: 80% of the work in machine learning isn’t choosing algorithms or tuning hyperparameters—it’s preparing and transforming data into useful features.

In automation, we know this intuitively:

  • Writing a good Terraform module isn’t just about resource blocks
  • It’s about input validation, variable transformation, conditional logic, and computed values
  • The infrastructure code is 20%; the variable design and data flow is 80%

Feature engineering is the ML equivalent: transforming raw data into inputs that help the model learn patterns effectively.

We touched on this in Chapter 2.1, but now we’ll dive deeper into the specific techniques that make or break real ML projects.


Why Feature Engineering Matters More Than Algorithms

I tested this with our deployment risk model:

Experiment 1: Simple features + Random Forest

  • Features: files_changed, hour, day_of_week
  • Accuracy: 72%

Experiment 2: Engineered features + Decision Tree

  • Features: is_large_change, is_business_hours, recent_failure_rate, team_reliability_score
  • Accuracy: 86%

The takeaway: Better features with a simpler algorithm beat weak features with a complex algorithm.

Automation analogy:

  • Good input validation + simple bash script > No validation + complex Python
  • Well-structured Terraform variables + standard modules > Messy variables + custom code

The Feature Engineering Workflow

flowchart TD
    Start([Raw Data])
    
    Understand[1. Understand<br/>the Problem]
    Explore[2. Explore<br/>Raw Data]
    Create[3. Create<br/>New Features]
    Select[4. Select<br/>Best Features]
    Validate[5. Validate<br/>Impact]
    
    Model{Model<br/>Performance<br/>Good?}
    
    Deploy[Deploy to<br/>Production]
    Iterate[Iterate:<br/>New Ideas]
    
    Start --> Understand
    Understand --> Explore
    Explore --> Create
    Create --> Select
    Select --> Validate
    Validate --> Model
    
    Model -->|Yes| Deploy
    Model -->|No| Iterate
    Iterate --> Create
    
    style Start fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Deploy fill:#e8f5e9,stroke:#388e3c,stroke-width:3px
    style Model fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style Iterate fill:#ffebee,stroke:#d32f2f,stroke-width:2px

This looks like:

  • Terraform development: Plan → Apply → Validate → Refactor
  • CI/CD pipeline tuning: Build → Test → Analyze → Optimize

Technique 1: Encoding Categorical Variables

The Problem

ML algorithms work with numbers, not categories. But much of our data is categorical:

  • Environment: dev, staging, prod
  • Team: Platform, API, Frontend
  • Deployment type: hotfix, feature, rollback

Solution: One-Hot Encoding

Automation analogy: Converting string variables into boolean flags.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Terraform-style thinking
variable "environment" {
  type = string
  validation {
    condition = contains(["dev", "staging", "prod"], var.environment)
  }
}

locals {
  is_dev     = var.environment == "dev"
  is_staging = var.environment == "staging"
  is_prod    = var.environment == "prod"
}

ML implementation:

Originalis_devis_stagingis_prod
dev100
staging010
prod001

When to Use

  • Use one-hot encoding when:
    • Categories have no inherent order (dev isn’t “less than” prod)
    • You have fewer than 10-15 categories
  • Alternatives for many categories:
    • Target encoding: Replace category with average outcome
    • Embedding: Let the model learn category representations (deep learning)

Example: Deployment Risk

1
2
3
4
5
6
7
8
9
10
11
Raw feature:
- deployment_type: "hotfix"

One-hot encoded:
- is_hotfix: 1
- is_feature: 0
- is_rollback: 0


> **Why this helps:** The model learns: hotfix → higher risk. Without encoding, "hotfix" is just a string (meaningless to the algorithm).
{: .prompt-tip }

Technique 2: Feature Scaling and Normalization

The Problem

Features with different scales can dominate the model:

Deploymentfiles_changedduration_minutesteam_size
1500158
25012012

Issue: files_changed (0-1000) dominates team_size (5-20)

Solution: Normalize to Similar Scales

Automation analogy: Converting resource counts to percentages of capacity.

1
2
3
4
5
# Like monitoring thresholds
cpu_usage_percent = (current_cpu / max_cpu) * 100
memory_usage_percent = (current_memory / max_memory) * 100

# Both now on 0-100 scale, comparable

Common Scaling Methods

1. Min-Max Scaling (0 to 1)

1
2
3
4
5
scaled_value = (value - min) / (max - min)

Example:
files_changed: 500 → (500 - 0) / (1000 - 0) = 0.5
duration: 120 → (120 - 5) / (300 - 5) = 0.39

2. Standardization (Z-score)

1
2
3
4
5
scaled_value = (value - mean) / standard_deviation

Example:
files_changed: 500 → (500 - 200) / 150 = 2.0
(500 is 2 standard deviations above mean)

When to Use

  • Min-Max: When you need values in a specific range (0-1, 0-100)
  • Standardization: When you want to preserve outlier information
  • No scaling: For tree-based models (Decision Trees, Random Forest) - they don’t care about scale

Technique 3: Binning and Discretization

The Problem

Sometimes continuous numbers are too granular. Grouping them into ranges can reveal patterns.

Solution: Create Bins (Buckets)

Automation analogy: Like alert severity levels.

1
2
3
4
5
6
7
# Monitoring thresholds
if cpu < 70:
    severity = "normal"
elif cpu < 85:
    severity = "warning"
else:
    severity = "critical"

ML application:

files_changedchange_size_category
5small
50medium
500large

Binning Strategies

1. Equal-Width Bins

1
2
3
Small: 0-100 files
Medium: 101-500 files
Large: 501+ files

2. Equal-Frequency Bins (Quantiles)

1
2
3
Small: Bottom 33% of values
Medium: Middle 33%
Large: Top 33%

3. Domain-Driven Bins

1
2
3
4
Based on your expertise:
Small: 0-50 (safe deployments)
Medium: 51-200 (review required)
Large: 201+ (high risk)

When to Use

  • Use binning when:
    • You have domain knowledge about meaningful thresholds
    • The relationship is step-wise, not smooth
    • You want interpretable categories
  • Example: Deployment hour → is_business_hours is more useful than raw hour (14 vs 2)

Technique 4: Creating Interaction Features

The Problem

Sometimes the combination of features matters more than individual features.

The Insight

In automation, we know this:

1
2
3
4
5
6
7
8
# Terraform: Risk isn't just one variable
locals {
  high_risk = (
    var.environment == "prod" &&
    var.change_size == "large" &&
    var.time_window == "business_hours"
  )
}

ML equivalent: Create features that capture these interactions.

Example: Deployment Risk

Individual features:

  • is_prod: 1
  • is_large_change: 1
  • is_business_hours: 1

Interaction feature:

  • prod_AND_large_AND_business_hours: 1

Why this helps: The model learns “this specific combination is risky” more easily.

Common Interactions

flowchart LR
    F1[Feature 1:<br/>Environment]
    F2[Feature 2:<br/>Change Size]
    F3[Feature 3:<br/>Time]
    
    I1[Interaction:<br/>env_size]
    I2[Interaction:<br/>env_time]
    I3[Interaction:<br/>size_time]
    I4[Interaction:<br/>env_size_time]
    
    F1 --> I1
    F2 --> I1
    
    F1 --> I2
    F3 --> I2
    
    F2 --> I3
    F3 --> I3
    
    F1 --> I4
    F2 --> I4
    F3 --> I4
    
    style F1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style F2 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style F3 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style I4 fill:#e8f5e9,stroke:#388e3c,stroke-width:3px

When to Use

  • Use interaction features when:
    • Domain knowledge suggests combinations matter
    • Individual features alone don’t predict well
    • You have tree-based models (they can learn interactions, but explicit features help)
  • Avoid when:
    • You have too many features (combinatorial explosion)
    • Using neural networks (they learn interactions automatically)

Technique 5: Time-Based Features

The Problem

Timestamps are hard for models to use directly: 2026-01-07 14:32:15 is just a number.

Solution: Extract Meaningful Time Components

Automation analogy: Parsing timestamps for scheduling logic.

1
2
3
4
5
6
7
# Like cron expressions or monitoring windows
from datetime import datetime

dt = datetime.now()
is_weekend = dt.weekday() >= 5
is_business_hours = 9 <= dt.hour < 17
is_end_of_month = dt.day > 25

Useful Time Features

From timestamp 2026-01-07 14:32:15:

Extracted FeatureValueWhy It Matters
hour14Deployment time affects risk
day_of_week2 (Tuesday)Weekday patterns
is_weekend0Weekend deployments riskier
is_business_hours1Business hours = more users affected
week_of_month1End of month rush?
is_end_of_sprint1Sprint deadlines = rushed deploys

Advanced: Lag and Rolling Features

Automation analogy: Like monitoring trends, not just current values.

1
2
3
4
5
6
7
8
9
Current deployment:
- current_cpu: 75%

Historical features:
- avg_cpu_last_7_days: 65%
- max_cpu_last_24h: 85%
- cpu_trend: increasing

These show context, not just snapshot

ML application:

FeatureDescriptionExample
deployments_last_7_daysRecent activity12
failure_rate_last_30_daysHistorical success0.15
time_since_last_failureRecency of issues5 days
avg_duration_last_10Typical duration22 min

When to Use

  • Use time features when:
    • Temporal patterns exist (time of day, day of week)
    • Historical context matters
    • Trends indicate risk (increasing failures, accelerating changes)

Technique 6: Aggregation and Summary Statistics

The Problem

Raw lists or sequences are hard to use: previous_deployments: [15, 22, 18, 30, 12]

Solution: Compute Summary Statistics

Automation analogy: Like log aggregation for metrics.

1
2
3
4
5
6
7
8
9
# Monitoring aggregation
response_times = [120, 135, 118, 450, 125]

metrics = {
    "avg": mean(response_times),      # 189.6
    "p95": percentile(response_times, 95),  # 450
    "max": max(response_times),       # 450
    "std_dev": std(response_times)    # High variance = inconsistent
}

ML application:

From [15, 22, 18, 30, 12] create:

  • deployment_avg_duration: 19.4
  • deployment_max_duration: 30
  • deployment_std_dev: 6.8
  • deployment_trend: increasing/decreasing

Common Aggregations

MetricWhat It Captures
MeanTypical value
MedianCentral value (robust to outliers)
Max/MinExtremes
Std DevVariability/consistency
CountFrequency
SumTotal

When to Use

  • Use aggregations when:
    • You have historical sequences
    • Patterns exist in trends (not just current value)
    • Variability matters (consistent vs erratic behavior)

Feature Selection: Choosing What Matters

After creating features, you often have too many. Feature selection removes the noise.

Why Feature Selection Matters

Problems with too many features:

  • ❌ Overfitting (model memorizes noise)
  • ❌ Slower training
  • ❌ Harder to interpret
  • ❌ More storage and computation

Automation analogy:

  • Too many Terraform variables = harder to use modules
  • Too many monitoring metrics = alert fatigue
  • Keep what adds value, remove the rest

Method 1: Correlation Analysis

Remove redundant features:

1
2
3
4
5
features_changed: 150
lines_changed: 3000

Correlation: 0.95 (highly correlated)
→ Keep one, drop the other

Method 2: Feature Importance (Tree-Based)

Let the model tell you what matters:

1
2
3
4
5
6
7
8
Random Forest feature importance:
1. recent_failure_rate: 0.35
2. is_prod: 0.25
3. change_size_category: 0.20
4. is_business_hours: 0.12
5. team_size: 0.08

→ Drop team_size (low importance)

Method 3: Recursive Feature Elimination

Iteratively remove least important features:

flowchart TD
    Start[Start with all features]
    Train1[Train model]
    Remove[Remove least important]
    Train2[Train again]
    Check{Performance<br/>still good?}
    Done[Final feature set]
    
    Start --> Train1
    Train1 --> Remove
    Remove --> Train2
    Train2 --> Check
    Check -->|Yes| Remove
    Check -->|No| Done
    
    style Start fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Done fill:#e8f5e9,stroke:#388e3c,stroke-width:3px
    style Check fill:#fff3e0,stroke:#f57c00,stroke-width:2px

Method 4: Domain Knowledge

Use your expertise:

1
2
3
4
5
6
7
8
9
Created features:
- deployment_size
- files_changed
- lines_added
- lines_deleted
- commit_message_length ← Probably useless
- author_name_length ← Definitely useless

→ Drop the obviously irrelevant ones

Practical Example: Deployment Risk (End-to-End)

Step 1: Start with Raw Data

1
2
3
4
5
6
deployment_id: 12345
timestamp: 2026-01-07 14:30:00
environment: prod
files_changed: 150
team: Platform
previous_deployments: [Success, Success, Failed, Success]

Step 2: Engineer Features

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Categorical encoding
is_prod = 1
is_staging = 0
is_dev = 0

# Binning
change_size = "large"  # (>100 files)

# Time features
hour = 14
is_business_hours = 1
day_of_week = 2
is_weekend = 0

# Historical aggregation
recent_failure_rate = 0.25  # (1/4 recent failures)
deployments_last_7_days = 8

# Interaction
prod_AND_large = 1  # (is_prod AND change_size=="large")

Step 3: Select Features

1
2
3
4
5
6
7
8
Feature importance ranking:
1. recent_failure_rate: 0.35
2. prod_AND_large: 0.28
3. is_business_hours: 0.18
4. deployments_last_7_days: 0.12
5. day_of_week: 0.07 ← Drop (low importance)

Final features: Top 4

Step 4: Result

Before feature engineering:

  • Raw features: timestamp, environment, files_changed
  • Accuracy: 72%

After feature engineering:

  • Engineered features: recent_failure_rate, prod_AND_large, is_business_hours, deployments_last_7_days
  • Accuracy: 88%

Feature Engineering Best Practices

1. Start Simple, Iterate

  • ✅ Begin with obvious features
  • ✅ Add complexity based on results
  • ❌ Don’t create 100 features on day one

Automation analogy: Start with basic Terraform, refine based on needs.


2. Use Domain Knowledge

Your automation expertise is your superpower:

  • You know peak hours matter → Create is_business_hours
  • You know team reliability varies → Create team_success_rate
  • You know change size indicates risk → Create change_size_category

This beats blindly trying transformations.


3. Validate Feature Impact

Test each feature:

1
2
3
4
5
Baseline (no feature): 75% accuracy

Add recent_failure_rate: 82% accuracy ✅ Keep it
Add commit_message_length: 75% accuracy ❌ Drop it
Add prod_AND_large: 85% accuracy ✅ Keep it

4. Avoid Data Leakage

Data leakage = Using information that won’t be available at prediction time

Bad example:

1
2
3
4
5
# Feature: deployment_duration
# Label: deployment_success

Problem: You don't know duration until AFTER deployment completes
→ Can't use this feature to predict success before deploying

Automation analogy: Can’t use the result to predict the result.


5. Document Your Features

Future you (and your team) will thank you:

1
2
3
4
5
6
7
8
9
10
features:
  recent_failure_rate:
    description: "Ratio of failed deployments in last 30 days"
    calculation: "failures / total_deployments (last 30 days)"
    range: "0.0 to 1.0"
    
  prod_AND_large:
    description: "High risk indicator: production + large change"
    calculation: "is_prod AND change_size=='large'"
    values: "0 or 1"

My Feature Engineering Checklist

Before building a model, I now ask:

  1. What domain knowledge can I encode?
    • Time patterns, business rules, known thresholds
  2. Are categorical variables encoded?
    • One-hot encoding, target encoding, or embeddings
  3. Are scales normalized?
    • If using distance-based algorithms (KNN, SVM)
  4. Have I created interaction features?
    • Combinations that matter based on expertise
  5. Have I extracted time components?
    • Hour, day, business hours, trends
  6. Have I aggregated historical data?
    • Means, trends, failure rates
  7. Have I tested feature importance?
    • Drop low-value features
  8. Am I avoiding data leakage?
    • Only use information available at prediction time

Key Takeaways

Key Takeaways:

  • Feature engineering is 80% of ML work: Good features + simple algorithm > Bad features + complex algorithm
  • Your domain expertise is your biggest advantage
  • Common techniques:
    • One-hot encoding: Convert categories to numbers
    • Scaling: Normalize feature ranges
    • Binning: Group continuous values into categories
    • Interactions: Combine features that matter together
    • Time features: Extract meaningful time components
    • Aggregations: Summarize historical patterns
  • Feature selection matters: Remove redundant features, use feature importance rankings, keep what adds value, drop the rest
  • Think like an automation engineer: Feature engineering = data transformation pipelines (like Terraform locals, Ansible filters, data preprocessing). Apply the same rigor you apply to infrastructure code.

What’s Next?

Series 4 – Deep Learning

In the next series, we’ll explore:

  • Why deep learning exists
  • How neural networks work (without the math)
  • When to use deep learning vs traditional ML
  • The connection to LLMs and generative AI

Architectural Question: How does deep learning change the role of feature engineering, and when should you trust a neural network to learn features for you?

The key insight: Deep learning automates feature engineering—the network learns features from raw data instead of you manually creating them.


This post is licensed under CC BY 4.0 by the author.

© 2026 Ravi Joshi. Some rights reserved. Except where otherwise noted, the blog posts on this site are licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License.