Building AI that works with messy data (it's always messy)

Jamie Blair22 October 20258 min read

Every AI project I've worked on has started the same way. The client says "we have great data." I pull it into a notebook, run a few basic checks, and discover that "great data" means "it exists."

Missing fields. Inconsistent categories. Date formats that change between rows. Free-text fields where someone typed "N/A", "n/a", "NA", "none", "0", and a blank cell, all meaning the same thing. Duplicate records with slight variations. Values that are obviously wrong but weren't caught because nobody was looking.

This isn't a failure. It's normal. Every business generates messy data because data entry is done by humans in a hurry, across multiple systems, with no standardisation enforcement. The question isn't whether your data is messy. It's how messy, and what to do about it.

Why data quality matters for AI

A machine learning model learns patterns from your data. If the data is inconsistent, the model learns inconsistent patterns. Garbage in, garbage out applies more directly to ML than almost anywhere else in software.

Here's a concrete example. I was building a predictive maintenance model using property management data. The "issue_category" field had been free-typed by staff for 3 years. The same problem appeared as:

"Boiler repair"
"Heating issue"
"Boiler not working"
"No hot water"
"Central heating broken"
"boiler"
"Heating"

That's 7+ categories for essentially the same problem. When I trained a model to predict future issues by category, it treated each of these as different problems. The predictions were useless because the categories were noise, not signal.

After normalising these into a single "Heating/Boiler" category, the model's accuracy jumped from 58% to 72%. Same data, same model, just cleaner labels.

The three biggest data problems

After doing this across a dozen projects, the same three problems show up almost every time.

1. Inconsistent categorisation

Any field where humans type or select a category will have inconsistencies. Support ticket types, product categories, expense categories, issue descriptions. The more people entering data and the longer they've been doing it, the worse it gets.

How I fix it:

Start with frequency analysis. Pull every unique value and sort by count:

import pandas as pd

df = pd.read_csv("maintenance_tickets.csv")
print(df["category"].value_counts())

This usually reveals a long tail: 10-15 categories that cover 80% of records and 50+ categories that cover the remaining 20%. Many of those 50 are duplicates or near-duplicates.

For the obvious duplicates, a mapping dictionary handles it:

CATEGORY_MAP = {
    "boiler repair": "heating_boiler",
    "heating issue": "heating_boiler",
    "boiler not working": "heating_boiler",
    "no hot water": "heating_boiler",
    "central heating broken": "heating_boiler",
    "boiler": "heating_boiler",
    "heating": "heating_boiler",
    # ... more mappings
}

df["category_clean"] = (
    df["category"]
    .str.lower()
    .str.strip()
    .map(CATEGORY_MAP)
    .fillna(df["category"].str.lower().str.strip())
)

For the fuzzy matches that aren't obvious, I use rapidfuzz:

from rapidfuzz import process, fuzz

unique_cats = df["category_clean"].unique().tolist()
canonical = ["heating_boiler", "plumbing", "electrical", "structural", "pest_control"]

for cat in unique_cats:
    match, score, _ = process.extractOne(cat, canonical, scorer=fuzz.token_sort_ratio)
    if score > 80:
        CATEGORY_MAP[cat] = match

This catches things like "plumming" (typo), "electrical issue" (extra word), and "pest contol" (misspelling) automatically.

2. Missing fields

Some records are missing values. Sometimes 5%. Sometimes 40%. The approach depends on what's missing and why.

Types of missing data:

Missing completely at random (MCAR): The field is blank for no particular reason. A data entry person skipped it. These are the easiest to handle.
Missing not at random (MNAR): The field is blank because of the value itself. For example, "resolution_cost" is blank for unresolved tickets. The missingness tells you something.
Systematically missing: An entire source or time period has blanks because the field was added later or one team doesn't fill it in.

How I handle each:

For MCAR with less than 10% missing: impute with the median (numerical) or mode (categorical). Simple and usually good enough.

df["cost"] = df["cost"].fillna(df["cost"].median())
df["priority"] = df["priority"].fillna(df["priority"].mode()[0])

For MNAR: create a separate "missing" indicator. The fact that the field is blank is information the model can use.

df["cost_missing"] = df["cost"].isna().astype(int)
df["cost"] = df["cost"].fillna(0)

For systematically missing: investigate the source. If half your data is from before 2024 and a key field was only added in 2025, you might need to drop the older records or accept that the model only learns from newer data. There's no trick that manufactures information that doesn't exist.

Warning

Never fill missing values with 0 without thinking about what 0 means. If "cost" is blank because the job hasn't been billed yet, filling it with 0 tells the model "this was free." Use NaN indicators instead.

3. Format variation across sources

When data comes from multiple systems (and it always does), formats diverge. Dates are the worst offender:

System A: "25/03/2026" (DD/MM/YYYY)
System B: "2026-03-25" (ISO)
System C: "March 25, 2026" (long format)
System D: "03/25/2026" (MM/DD/YYYY, American)

Same date, four formats. Parse them wrong and March 25th becomes the 3rd of the 25th month (error) or the 3rd of December (silent, dangerous error).

My approach: normalise everything to ISO format (YYYY-MM-DD) as the first step, before any analysis:

from dateutil import parser

def parse_date(val):
    try:
        return parser.parse(str(val), dayfirst=True).date()
    except (ValueError, TypeError):
        return None

df["date_clean"] = df["date_raw"].apply(parse_date)

# Check how many failed
failed = df["date_clean"].isna().sum()
print(f"{failed} dates could not be parsed ({failed/len(df)*100:.1f}%)")

The dayfirst=True parameter handles the UK vs US ambiguity for most cases. For the ones that fail, manual inspection is usually needed. There's no automated way to know whether "01/02/2026" is January 2nd or February 1st without context.

The 80/20 rule of data cleaning

Here's the practical reality: you don't need perfect data. You need data that's clean enough for the model to learn useful patterns.

In my experience, the breakdown looks like this:

First 2 hours: Basic cleaning (normalise formats, fix obvious duplicates, handle blanks). Fixes 50-60% of issues.
Next 4 hours: Category mapping, fuzzy matching, outlier handling. Fixes another 20-25%.
Next 8 hours: Edge cases, cross-source reconciliation, manual review of ambiguous records. Fixes another 10-15%.
Beyond that: Diminishing returns. The remaining 5% of issues would take another 20+ hours for marginal improvement.

Most projects land in a good place after 6-8 hours of dedicated cleaning. The model accuracy difference between "8 hours of cleaning" and "40 hours of cleaning" is usually 2-3 percentage points. Not worth it unless you're in a domain where every percentage point matters.

What "clean enough" looks like

For most business AI projects, these are the targets I aim for:

| Metric | Target | Why | |---|---|---| | Missing values in key fields | Below 5% | Higher than this and the model's learning from too many imputed values | | Category consistency | 90%+ of records in canonical categories | The long tail can be grouped into "other" | | Date parsing success | 99%+ | Dates are critical for time-series features. Missing dates break things. | | Duplicate detection | 95%+ identified | Perfect dedup is impossible. 95% is good enough. | | Outlier handling | Reviewed and decided | Not removed automatically. Some outliers are real. |

When I hit these numbers, the model can train on meaningful patterns. The remaining messiness shows up as noise, and a well-tuned model handles noise reasonably well.

Data cleaning is not a one-time job

This is the part that surprises most clients. You clean the data, train the model, deploy it, and then new data arrives that's just as messy as the original data was.

New suppliers send invoices in new formats. Staff turnover means different people entering data differently. A system update changes export formats. The data pipeline needs to clean incoming data continuously, not just the historical batch.

I build cleaning logic into the data pipeline itself. Every record that enters the system goes through the same normalisation, mapping, and validation steps. Records that fail validation get flagged for human review rather than silently imported with errors.

This turns data cleaning from a project into a process. Which is exactly how it should be.

Key Takeaways

Data cleaning takes 60-80% of project time. Budget for it explicitly, not as an afterthought.
Inconsistent categories, missing fields, and format variation are the three problems you'll always face.
Simple rules (mapping dictionaries, format normalisation) fix 70% of issues in a few hours.
Clean enough for 90%+ model accuracy is the goal. Perfect data is not achievable or necessary.
Build cleaning into the pipeline so it handles new data automatically, not just the historical batch.

If you're sitting on business data that you suspect is messy and want to know whether it's usable for AI, that's a question I can answer quickly. A short data assessment tells you what's there, what's missing, and what it would take to get it to a useful state. Get in touch and send me a sample export. I'll give you a straight answer.

Related reading:

Building a data pipeline with Python: a practical guide
PDF parsing in 2026: what actually works
My custom AI solutions service: bespoke models built on your data