The Audit That Changed How I Think About This
A financial services client asked us to audit a credit card default prediction model that had been running in production for eighteen months. On paper it was a success: 87% accuracy, stable drift metrics, clean CI/CD. The business was happy.
We ran a standard fairness audit as part of the engagement. What we found was uncomfortable. When we sliced the model's predictions by postcode a feature the team had assumed was purely geographic the false positive rate (flagging a borrower as likely to default when they wouldn't) was significantly higher in lower-income urban areas. The model was technically accurate in aggregate but systematically wrong in one direction for a specific demographic group.
Postcode wasn't a protected attribute. It was a proxy for one.
The model hadn't been built with discriminatory intent. It had been built with insufficient awareness of how historical data encodes societal inequality and then deployed without any mechanism to detect or correct for it.
This is the pattern I see repeatedly across industries. Not malice. Not carelessness. Just a gap between what "responsible AI" means on a slide deck and what it requires in practice. This post is about closing that gap.
Why the Checklist Approach Fails
Every major technology company has published a set of AI principles. Google has seven. Microsoft has six. Meta, Amazon, IBM all have their version. They tend to look like this:
- Be fair and inclusive
- Be transparent and explainable
- Protect privacy
- Be accountable
- Be safe and secure
These are not wrong. They are just not actionable at the level of a sprint ticket. The problem is that the industry has treated RAI as a governance problem something you solve with a policy document, a review board, and a checkbox in the pre-deployment checklist. Governance matters, but it cannot substitute for engineering.
When a team ticks "Fairness Reviewed" before shipping a model, what did they actually do? In most organisations I've worked with, the honest answer is: they eyeballed the overall accuracy, confirmed there was no obviously sensitive column in the feature set, and moved on. That process would not have caught the postcode problem above. It does not catch proxy discrimination, intersectional bias, or feedback loop effects the three most common sources of harm in real production systems.
The Fairness Definition Problem
Before you can measure fairness, you need to define it. This sounds straightforward until you realise that the most common fairness definitions are mathematically incompatible with each other. You cannot simultaneously satisfy all of them on a realistic dataset.
Here are the four definitions you'll encounter most often:
| Definition | What It Requires | Example |
|---|---|---|
| Demographic Parity | Equal positive prediction rate across groups | Approve loans at the same rate for Group A and Group B |
| Equalized Odds | Equal true positive rate AND equal false positive rate across groups | Same sensitivity and specificity for both groups |
| Equal Opportunity | Equal true positive rate across groups (relaxes false positive constraint) | Qualified borrowers from both groups approved at same rate |
| Calibration | Predicted probability equals actual outcome rate across groups | "60% default probability" means 60% actual default for both groups |
The Impossibility Theorem (Chouldechova, 2017; Kleinberg et al., 2016) formally proved that if the base rates of the outcome differ between groups which they almost always do in real datasets, precisely because of historical inequality you cannot achieve demographic parity, equalized odds, and calibration simultaneously.
This is not a bug in the math. It is a signal that fairness is a values question as much as a technical one. The choice of which fairness definition to optimise for should be made explicitly, by domain experts and stakeholders, before model training begins not left implicit in the loss function.
Five Pipeline Stages Where Bias Enters
Bias is not something that appears at the end of the pipeline when you run a fairness audit. It accumulates across every stage. Understanding where it enters helps you intervene at the right point.
1. Data Collection
Historical data encodes historical decisions. If a bank historically denied more loans to applicants from certain postcodes, a model trained on that data will learn that postcodes predict default not because the geography causes default, but because the geography correlates with who was denied credit in the first place. This is called historical bias or societal bias in the training data.
Underrepresentation is the second data collection problem. If a clinical dataset has 90% data from one demographic group, a model trained on it will perform worse for everyone else. This is endemic in healthcare AI and it is a direct patient safety issue.
2. Feature Selection
Proxy features are features that are not sensitive attributes themselves but are highly correlated with them. Postcode correlates with race and income. Name correlates with ethnicity and gender. Device type correlates with age and income. Removing the obvious sensitive columns (race, gender, religion) does not remove the proxies.
This is why naive feature selection "we removed gender, we're fine" does not work. You need to actively audit feature-to-sensitive-attribute correlation before training.
3. Model Architecture and Training
The loss function you optimise implicitly encodes values. Standard cross-entropy loss optimises for average accuracy which means the model may sacrifice performance on minority groups to improve aggregate numbers. Gradient boosting and deep networks are particularly prone to this because they are expressive enough to learn complex proxy discriminations that simpler models would miss.
4. Decision Threshold Setting
Most classification models output a probability. You convert that to a binary decision by setting a threshold. The threshold is almost always set once, globally, based on optimising a metric like F1. But a single threshold applied uniformly can produce dramatically different false positive and false negative rates across demographic groups, even when the underlying model probabilities are well-calibrated.
Per-group threshold calibration is a simple and highly effective mitigation but it is rarely implemented because it requires you to have explicitly decided which group-level fairness metric you care about.
5. Deployment and Feedback Loops
A model that produces biased decisions shapes the data it will be retrained on. If a hiring model deprioritises candidates from certain universities, fewer such candidates will be interviewed, fewer will receive offers, fewer will appear as "successful hires" in the training data, and the next model version will deprioritise them even more strongly. This is called a feedback loop or performative prediction, and it is the mechanism by which small initial biases compound into large systemic ones over time.
Tools That Actually Work
Three tools form the core of a practical RAI engineering stack.
Microsoft Fairlearn
Fairlearn is the most practically useful open-source library for fairness-aware ML. It provides two things: a dashboard for visualising performance across demographic groups, and mitigation algorithms that constrain the optimisation during training.
from fairlearn.metrics import MetricFrame, demographic_parity_difference
from fairlearn.reductions import ExponentiatedGradient, DemographicParity
from sklearn.linear_model import LogisticRegression
import pandas as pd
# --- 1. Measure fairness BEFORE mitigation ---
mf = MetricFrame(
metrics={
"accuracy": lambda y, y_hat: (y == y_hat).mean(),
"false_positive_rate": lambda y, y_hat: ((y == 0) & (y_hat == 1)).sum() / (y == 0).sum()
},
y_true=y_test,
y_pred=y_pred,
sensitive_features=X_test["income_bracket"]
)
print(mf.by_group)
print("Demographic parity difference:", demographic_parity_difference(y_test, y_pred, sensitive_features=X_test["income_bracket"]))
# --- 2. Apply mitigation ---
base_estimator = LogisticRegression(max_iter=1000)
mitigator = ExponentiatedGradient(base_estimator, DemographicParity())
mitigator.fit(X_train, y_train, sensitive_features=X_train["income_bracket"])
y_pred_fair = mitigator.predict(X_test)
The MetricFrame object is what makes Fairlearn genuinely useful it disaggregates any metric you care about by any sensitive feature. Running this before deployment takes ten minutes and will surface group-level disparities that aggregate metrics hide completely.
SHAP (SHapley Additive Explanations)
SHAP gives you feature-level explanations for individual predictions, grounded in game theory. It tells you not just which features the model uses overall, but which features drove a specific decision for a specific person. This is what "explainability" actually means in practice not a global feature importance chart, but the ability to say "this application was declined primarily because of X, Y, and Z."
import shap
explainer = shap.TreeExplainer(model) # or shap.Explainer() for model-agnostic
shap_values = explainer.shap_values(X_test)
# Global summary which features matter most overall
shap.summary_plot(shap_values, X_test, plot_type="bar")
# Individual explanation why did this specific prediction happen?
shap.force_plot(
explainer.expected_value,
shap_values[sample_idx],
X_test.iloc[sample_idx]
)
# Detect proxy discrimination: plot shap values coloured by sensitive attribute
shap.dependence_plot("postcode_cluster", shap_values, X_test, interaction_index="income_bracket")
The dependence plot is particularly useful for proxy detection. If the SHAP value for postcode_cluster shows a strong pattern when coloured by income_bracket, you have a proxy feature operating in the model even if the overall feature importance looks benign.
IBM AI Fairness 360 (AIF360)
AIF360 provides a broader set of bias metrics (30+) and pre/in/post-processing mitigation algorithms (10+). It is more comprehensive than Fairlearn and better suited for research-grade fairness analysis or regulatory reporting where you need to demonstrate that you evaluated multiple fairness criteria.
Explainability Is Not Optional Anymore
The conversation about explainability has shifted from "nice to have" to "legally required" in the span of three years. The EU AI Act (Article 13), India's DPDP Act, and the US FTC's algorithmic accountability guidance all now establish in varying degrees of specificity that individuals have a right to a meaningful explanation of automated decisions that affect them.
The technical challenge is that global explainability (the model uses these features in these proportions) is not the same as local explainability (this decision was made for this reason). Courts and regulators care about the latter. A feature importance chart from your training run is not sufficient.
For LLM-based systems, explainability takes a different form. The analogous tools are attention visualisation, chain-of-thought prompting with logged reasoning traces, and output attribution systems like LangFuse traces that can tell you which retrieved documents or input segments most influenced a specific output. The underlying principle is the same: you need to be able to answer "why did the system produce this output for this user" not just "how does the system work in general."
The EU AI Act: What It Means for Practitioners in 2026
The EU AI Act entered full enforcement in August 2026. If you are building AI systems that process data about EU residents regardless of where you or your company are based the risk classification framework applies to you.
The Act defines four risk tiers. The tier that matters most for typical enterprise AI work is High Risk, which includes AI systems used in:
- Credit scoring and creditworthiness assessment
- Recruitment, CV screening, and interview scoring
- Healthcare triage, diagnosis, and treatment recommendation
- Education: student assessment, admission, and examination
- Law enforcement: crime prediction, evidence evaluation
- Border control and biometric identification
For high-risk systems, the Act mandates:
- Conformity assessment a documented evaluation before deployment that includes bias testing across protected groups
- Human oversight mechanisms the system must be designed to allow human intervention; fully automated high-stakes decisions are prohibited in most categories
- Technical documentation training data provenance, model architecture, performance metrics disaggregated by demographic group, and known limitations
- Ongoing monitoring post-deployment surveillance for bias drift, accuracy degradation, and distributional shift
- Logging audit logs sufficient to reconstruct what the system did and why, retained for a minimum period
The India Context: DPDP Act and High-Stakes AI
India's Digital Personal Data Protection Act (2023) is now operationally active and directly intersects with AI systems that process personal data. While the DPDP Act is primarily a data governance framework rather than an AI-specific regulation, several of its provisions have direct implications for ML systems.
The right to grievance redressal (Section 13) means that individuals can challenge automated decisions that use their personal data. The obligation to process only with a lawful basis and to limit data use to the stated purpose constrains what features an ML model can legally use. Data principals (individuals) can withdraw consent which creates engineering requirements around model unlearning or retraining workflows that most teams have not yet built.
The DPDP Act does not yet have the explicit risk-tiering of the EU AI Act, but enforcement guidance is evolving. For companies building AI for Indian financial services, healthcare, or government contexts, the responsible approach is to apply EU-AI-Act-style documentation and bias testing proactively not because the letter of Indian law currently requires it, but because the regulatory direction is clear and the cost of retrofitting compliance into a deployed system is much higher than building it in from the start.
Embedding RAI in the Development Cycle
The most important shift in how I approach this now is treating RAI as a continuous engineering discipline rather than a pre-deployment gate. Here is what that looks like in practice:
At project kickoff
- Define the sensitive attributes relevant to the problem (protected characteristics, proxy-correlated features)
- Choose the fairness definition to optimise for, documented in the requirements
- Set measurable fairness targets alongside accuracy targets (e.g., false positive rate disparity < 5% across income groups)
- Identify the regulatory tier the system falls into
During feature engineering
- Run correlation analysis between candidate features and sensitive attributes
- Flag and review high-correlation proxy features explicitly don't automatically drop them, but make a documented decision about whether to include them
- Audit data sources for underrepresentation of demographic groups
During model training
- Run
MetricFramedisaggregation after every evaluation run, not just before deployment - Apply mitigation techniques (reweighing, adversarial debiasing, or post-hoc threshold calibration) when fairness targets are breached
- Generate SHAP explanations on the validation set and review dependence plots for proxy patterns
At deployment
- Set up fairness monitoring alongside accuracy monitoring group-level false positive rates, approval rate disparities, and output distribution by demographic segment
- Define a fairness drift threshold that triggers model review (not just accuracy drift)
- Implement an explanation endpoint every prediction that affects a user should be explainable on request
Post-deployment
- Audit feedback loops: is the model's output creating data that it will be retrained on? If yes, how does this affect minority groups over multiple retraining cycles?
- Run quarterly fairness reviews on production data, not just the original test set
A Real Example: Credit Card Default Prediction
The credit card default project I referenced at the opening was the RAI engagement that most shaped how I think about this. Here is what the mitigation process actually looked like once we identified the postcode-proxy problem.
Step 1 Quantify the disparity. Using Fairlearn's MetricFrame, we calculated false positive rates by income bracket (derived from postcode clustering). The disparity was 11 percentage points between the highest and lowest income brackets the model flagged low-income applicants as default risks at an 11% higher rate even when their actual default behaviour was equivalent.
Step 2 Identify the mechanism. SHAP dependence plots revealed that postcode_cluster was acting as a strong positive contributor to default risk probability specifically in clusters corresponding to lower-income areas even after controlling for actual credit utilisation and repayment history. The model had learned the proxy from historical approval data where low-income postcodes had been underserved.
Step 3 Apply mitigation. We tested two approaches: (a) removing postcode features entirely and retraining, and (b) applying Fairlearn's ExponentiatedGradient with a demographic parity constraint. Option (a) reduced accuracy by 4.2%. Option (b) reduced the fairness disparity from 11 to 2.8 percentage points with an accuracy cost of 1.7%. The client chose option (b) a sensible tradeoff.
Step 4 Set up ongoing monitoring. We added group-level false positive rate tracking to the existing MLflow monitoring dashboard. The fairness drift alert threshold was set at a 5 percentage point disparity above that, the model automatically flags for human review before the next retraining cycle.
The entire mitigation process, from audit to production update, took six working days. The cost of not doing it regulatory exposure under RBI's fairness guidelines for BFSI AI, reputational risk, and the actual harm to real borrowers was orders of magnitude higher.
FAQ
What is the difference between fairness and accuracy in machine learning?
Accuracy measures overall predictive correctness across all samples. Fairness measures whether that correctness is distributed equitably across demographic groups. A model can be highly accurate in aggregate while systematically underperforming for a specific group a phenomenon called the accuracy-fairness tradeoff. In practice, optimising purely for accuracy often amplifies existing societal inequalities in the training data.
What tools are used for bias detection in machine learning?
The main open-source tools are: Microsoft Fairlearn (fairness metrics and mitigation algorithms), IBM AI Fairness 360 (30+ bias metrics and 10+ mitigation algorithms), SHAP (model explainability), LIME (local interpretable explanations), and Google's What-If Tool. For LLM systems, Guardrails AI and LangKit add bias and toxicity detection layers to inference pipelines.
What does the EU AI Act require for high-risk AI systems?
High-risk systems must have a conformity assessment before deployment, human oversight mechanisms, detailed documentation of training data and model decisions, ongoing monitoring for bias and drift, and the ability to explain decisions to affected individuals on request. Full enforcement began in August 2026.
Can you have a fair model without sacrificing too much accuracy?
Usually yes. Techniques like reweighing training samples, adversarial debiasing, and threshold calibration per group typically reduce accuracy by 1–5% while significantly improving fairness metrics. The key insight is that fairness-accuracy tradeoffs are much smaller than often assumed and the business and legal risk of deploying an unfair model far outweighs a small accuracy reduction.
If you're building high-stakes AI systems and want help structuring a responsible AI review process covering bias audits, explainability, and regulatory compliance that's exactly the kind of engagement I take on. See how I work with enterprise teams →
The credit card default project mentioned in this post is part of my Responsible AI portfolio work more technical detail on the fairness mitigation pipeline there.