A foundation model flags equipment anomalies in heavy-asset industry. I designed the human review system between its probabilistic output and the engineers who act on it - from triage through investigation to decision.
The Challenge
The model produces confidence scores. Engineers make maintenance decisions. A missed anomaly on a high-pressure separator can escalate to shutdown. A false positive erodes trust in a model operators have never worked with before. The question was not how to display anomalies - it was how to structure human judgment around uncertain model output in a high-consequence environment.

What I Did
The workflow has two distinct modes: triage and investigation. Triage is about speed and signal. Investigation is about depth and context. I designed for both as a connected sequence.
Triage
Engineers receive a continuous stream of flagged anomalies. Most don't warrant investigation. The design problem was giving engineers enough signal to make that call in seconds - without pulling them into the detail of every alert.
Each anomaly surfaces a preview: severity, confidence score, affected equipment, and a brief AI-generated description. The preview is a decision artifact, not a summary. It answers one question: is this worth my time?
Threshold governance sits upstream. The model outputs a continuous confidence score (0–1); engineers act on categories. I designed the confidence-to-severity mapping as an adjustable, simulatable policy surface - move a threshold, see exactly how many anomalies are reclassified. A minimum confidence floor controls what reaches the inbox at all. Alert volume becomes a deliberate policy choice.
When an engineer decides an anomaly doesn't warrant investigation, dismissal is structured. They select from a defined taxonomy - normal operation, sensor issue, scheduled event, detection too sensitive - and rate their own confidence in the dismissal. The taxonomy feeds model retraining. The self-rating introduces a deliberate pause: engineers who rate themselves "Uncertain" while dismissing a high-severity anomaly sometimes reverse the decision. The friction is the feature.
Investigation
When an anomaly warrants attention, the engineer moves into a four-zone workspace designed around a single question: what is happening, and what should I do about it?
The left panel provides process context - where the affected equipment sits within the broader system, its relationships to upstream and downstream processes, and relevant operational metadata. This orients the engineer before they touch any data.
The center area is the primary visualization space - time series data with the anomaly highlighted in context. Engineers overlay related signals, zoom across time windows, and compare against historical baselines. This is the primary analytical surface.
The right panel houses investigation tools, including custom AI agents. Engineers pin specific data - a time series, a work order, a correlation chart - to an agent's scope before asking a question. They see what the agent sees. When a response is poor, they adjust the scope rather than losing trust in the system.
The bottom drawer contains the investigation record. Descriptions are pre-populated from structured data. Always editable. Included in the final record as the engineer's own words - not a separate AI artifact. The drawer pattern keeps the record accessible without competing for the visualization space that drives the actual analysis.

How I Worked
Eight-week engagement from kickoff to presentation. I ran discovery with domain engineers to map existing decision workflows, designed the system architecture, built and evaluated a functional prototype, and worked directly with data science on the feedback loop between human judgments and model calibration.

Key Observations
Alert fatigue is a threshold governance problem. The higher-leverage intervention is not improving the inbox - it is making the confidence floor visible, adjustable, and simulatable, so alert volume is a deliberate policy choice.
Asking humans to rate their own confidence changes their behavior. The self-rating is not primarily data collection. It is an intervention that causes reflection before commitment.
AI text should be pre-populated, not presented. When AI output is shown as a suggestion, the human evaluates the AI. When it is pre-populated in an editable field, the human owns it. The second framing produces better outcomes.
Foundation model deployment is a human systems problem. Detection performance is necessary but not sufficient. Value is realized when an engineer makes a correct decision faster. That depends on how uncertainty is communicated, how thresholds are governed, and where the boundaries of AI authority are drawn.