Throughput Outliers: IQR Detection & AI Semantic Adjustment

Three weeks ago, your team closed 22 items in a single sprint. Your average is 6. The outlier is now in your data, and your forecast is going to be optimistic about the future — unless you handle it.

Throughput outliers are a normal part of agile delivery. Bug-bash sprints, focused cleanup weeks, last-minute pushes before a deadline — they all produce data points that don't represent how the team works most of the time.

This article covers two complementary techniques for handling outliers: IQR detection (statistical) and AI semantic adjustment (semantic). Both ship in Nexus Hub Pro; both can be implemented manually with reasonable effort.

What counts as an outlier

An outlier is a data point that lies far from the rest of the distribution. The "far" definition depends on the method:

Statistical: more than 1.5 × IQR above the third quartile, or below the first quartile
Semantic: a data point caused by a one-off circumstance unlikely to repeat

Statistical detection finds outliers in throughput numbers. Semantic detection finds outliers in upcoming work that the throughput history may not predict well.

IQR detection — the method

The interquartile range (IQR) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). Tukey's rule (1977): anything outside Q1 − 1.5×IQR to Q3 + 1.5×IQR is flagged as an outlier.

Worked example. Throughput over 13 weeks: 5, 6, 4, 7, 3, 8, 5, 22, 2, 9, 5, 7, 4.

Sort: 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 8, 9, 22
Q1 (25th percentile) = 4
Q3 (75th percentile) = 7
IQR = Q3 − Q1 = 3
Upper fence = Q3 + 1.5 × IQR = 7 + 4.5 = 11.5
Lower fence = Q1 − 1.5 × IQR = 4 − 4.5 = -0.5 (effectively 0)
Outliers: anything > 11.5 → only the value 22 qualifies

Decision: keep, remove, or flag for review. Most tools (including Nexus Hub) default to flagging — the user decides.

What to do with an outlier

Three options:

Option A — Remove from the dataset

If you can identify a one-off cause (bug-bash sprint, cleanup week, deadline push that won't repeat), remove the outlier from the throughput sample. The forecast becomes representative of normal operations.

Risk: removing too aggressively makes your forecast over-confident. If "outlier" sprints happen quarterly, they're not outliers — they're a recurring pattern.

Option B — Keep but flag the noise

Keep the outlier in the dataset. Run the simulation. The output will show a wider P50–P95 range (because variance is higher). Stakeholders see the wider range and that's accurate to the team's actual variability.

This is the conservative choice. Recommended when the cause of the outlier is unclear or when it might repeat.

Option C — Investigate before deciding

Sometimes the outlier reveals something the team should know — exceptional collaboration, unblocked dependencies, a process change that's working. Before deciding to remove or keep, ask:

Was this caused by something the team controls (work composition, focus)?
Is the cause likely to recur?
Did anyone outside the team notice (good or bad)?

If yes to "controllable + recurring" — leave it in, the team has improved. If yes to "one-off + unlikely" — remove it.

The other kind of outlier — semantic risk

IQR detection finds anomalies in past throughput. It can't tell you which upcoming stories are going to be harder than the team's history suggests.

That's where semantic risk markers come in. Some keywords in work-item descriptions reliably signal "this story will take longer than its size implies":

legacy — touching old code with unknown dependencies
refactor — restructuring that often surfaces hidden coupling
spike — exploration with unknown bounds
migration — data or system moves with reconciliation work
compliance — additional sign-off cycles
integration — third-party APIs, version mismatches
third-party — external dependencies, vendor coordination

None of these guarantee that a story will slip. But across hundreds of stories, items with these markers run 1.4–2.2× the team's median throughput in our pilot data.

AI semantic adjustment — the method

Reading every story description manually doesn't scale past ~50 stories. Automated NLP scanning does:

Tokenize the work-item title + description
Match against a risk-marker dictionary (the list above + domain-specific extensions)
Compute a complexity factor — the multiplier to apply to forecast variance
Apply the factor to upcoming items in the simulation

The output is a Monte Carlo forecast where high-risk items contribute more variance to the simulation than low-risk items — even when their story-point sizes are identical.

In Nexus Hub Pro, this runs server-side as part of the Predictive Analytics module. Items with risk markers get a small badge in the per-item table; the team can override the AI assessment if a specific item is misclassified.

Worked example — the same backlog, with and without semantic adjustment

Suppose your team has 50 upcoming stories and a P85 forecast of 11 weeks. After AI semantic adjustment scans the descriptions:

30 stories — no risk markers
15 stories — one or two markers (e.g., "integrate with legacy auth")
5 stories — multiple markers (e.g., "migrate legacy data; refactor batch processor")

The complexity factor for those 20 risky items inflates simulation variance. New P85: 13 weeks (up from 11).

Two weeks. Possibly the difference between a stakeholder-trusted commit and a missed date.

Limits of automated detection

Both techniques have limits:

IQR limits

Doesn't work well on small samples (< 10 weeks)
Asymmetric distributions (heavy-tailed) need different methods (e.g., MAD)
Doesn't tell you why the outlier exists — only flags it

Semantic adjustment limits

Dictionary-based detection can miss novel risk signals
Domain-specific terms ("ICD-10 mapping," "Wave 4 rollout") need extension
False positives — a story mentioning "we built this so we don't need a refactor" gets flagged
The team should always have override capability

Both are tools, not authorities. The team's judgment is the final arbiter.

Implementation recipe

Manual approach (works for one team, episodically):

Pull last 13 weeks of throughput from Azure DevOps Analytics
Compute Q1, Q3, IQR in Excel or a notebook
Flag rows above Q3 + 1.5×IQR
For each flagged row, identify the cause and decide keep/remove
For upcoming stories, scan descriptions for the risk markers above; flag manually

Tooled approach (works at scale, ongoing):

Install Nexus Hub Pro — both IQR detection and AI semantic adjustment ship out of the box. Toggle outlier removal on per-simulation; AI adjustment runs automatically with team override per item.

Run forecasts that handle outliers automatically

Nexus Hub Pro detects outliers, flags risk markers, and produces calibrated forecasts on real Azure DevOps throughput. 14-day free Pro trial.

Install from Marketplace →