Studies

The Wrong Question, Asked at Scale

Blanco-style technical line drawing of a single magnifying lens at the centre, identical lines fanning out from it to a row of small office buildings and a uniform crowd of figures, with one lone figure softly stained coral and set apart, in thin black lines on white.

A landmark study of more than four million job applications shows how AI hiring tools hide their bias, why one rejection can become rejection everywhere, and why independent, position-level assessment is no longer optional.

A new study just did something the AI hiring industry has spent years insisting was unnecessary. It looked.

Researchers at Stanford, Chapman, and Northeastern analysed more than 4 million job applications from roughly 3 million applicants across 156 employers, most of them companies with five billion dollars or more in annual revenue, all screened by a single vendor. The paper, “Algorithmic Monocultures in Hiring,” goes to the ACM Conference on Fairness, Accountability, and Transparency in Montreal next month. Its first sentence of findings is blunt: the authors report “clear racial disparities” in who the algorithm recommends.

The headline number is the one that travels: more than 25% of all applications submitted by Black applicants, close to 40,000 submissions, went to positions where the tool produced outcomes that federal guidelines define as discriminatory. Asian applicants were affected at a comparable scale, with nearly 15% of their applications landing in the same category. That is the part everyone will share. The part that matters more for anyone running or buying these systems is how the disparity stayed hidden for so long.

They did not break the math. They changed the question.

The vendor in this study, the games-based assessment platform Pymetrics, had run its own fairness analysis and found nothing that reached the threshold of legal concern. The researchers did not dispute that math. They disputed the question it answered.

Pymetrics pooled all applicants and all outcomes together, across every employer and every role, then checked the aggregate for disparity. The Stanford-led team instead did what U.S. discrimination law actually asks for: they tested each of the 1,746 individual positions on its own, against the Equal Employment Opportunity Commission’s four-fifths rule. Measured that way, 10.62% of positions showed adverse impact against Black applicants, and 30% of Black applicants had applied to at least one of them.

“Aggregating individual positions up to occupation groups is enough, on its own, to make per-position discrimination disappear from the report. The bias was never absent. It was averaged away.”

Line drawing of a row of short, even vertical bars with one hidden bar spiking far above a flat average line, the tall bar softly stained coral, with a dashed threshold line below. — A system can pass at the portfolio level and fail, repeatedly, at the level where a real person is actually rejected. The single tall bar is the position where the harm lives; the average smooths it out of view.

This is the most important point in the entire study, and it has nothing to do with one vendor. A system can pass at the portfolio level and fail, repeatedly, at the level where a real person is actually rejected.

I have made this same argument in every fairness session I run, usually to a room that wants the comfortable answer. A single aggregate fairness score is not evidence of fairness. It is often the opposite: a number engineered, sometimes unintentionally, to be reassuring. Context is what turns a metric from “something to look at” into a basis for a decision, and the unit of context here is the position, not the platform.

The algorithmic blackball

The second finding is the one that should worry job seekers, and it is genuinely new. Because the same vendor scores candidates for many different employers, and because an algorithm gives the same output for the same input every time, a rejection from one company predicts rejection from the next far better than chance would allow. The researchers call this systemic rejection. Among applicants who applied to ten positions screened by the same vendor, 4% were rejected from all ten, a rate too high to be coincidence if each employer were deciding independently.

The mechanism is mundane, and that is what makes it serious. When a candidate plays the assessment games, their scores are stored and reused for up to 330 days. Two employers using the same vendor are not giving an applicant two evaluations. They are giving the same evaluation twice. The team calls the result an algorithmic blackball, a concept that had been theorised in the literature but never before observed at this scale in live, deployed data. Their simulation put a number on the cost to applicants: to push the chance of being shut out everywhere below 0.1%, a candidate would need to apply to at least 25 positions, more than double the ten that would suffice if each decision were truly independent.

“A human recruiter has a bad day, a blind spot, a different mood on Tuesday. That noise is, perversely, a kind of protection. A monoculture removes it. One model, one verdict, repeated across an entire labour market.”

One vendor, many employers: the monoculture problem

This is why the paper’s title says monoculture rather than bias. The deeper risk is concentration. The authors note that as of May 2023, more than 60% of the Fortune 100 and eight of the ten largest U.S. federal agencies relied on a single dominant vendor’s algorithms for hiring. When one model sits inside that many decisions, its quirks stop being a product flaw and become market infrastructure. A shortfall in one place is now a shortfall everywhere, simultaneously, and a single point of failure can disrupt hiring across thousands of employers at once.

Bias in a monoculture does not just affect more people. It affects them in a correlated way, which is a different and harder problem than the same number of independent errors.

The regulatory vise is closing

This study did not arrive in a vacuum. It arrived weeks before the rules change.

In the United States, New York City’s Local Law 144 was the first regulation aimed directly at automated hiring tools. The researchers found that the guidance around it appears to instruct auditors to pool data across positions and employers, which is precisely the aggregation method they show can mask disparity. A compliance regime can be satisfied and the underlying harm can remain fully intact. That gap should unsettle anyone who treats an audit checkmark as proof of fairness.

In Europe, the position is sharper. The EU AI Act classifies AI used in recruitment and hiring as high-risk by default, and the obligations for high-risk systems take effect on 2 August 2026. That is not a distant horizon. For any organisation operating in or hiring into the EU, the requirements for risk management, data governance, transparency, human oversight, and post-market monitoring are about to become legal duties rather than good intentions. A study showing position-level discrimination, surfaced by independent researchers, is a preview of exactly the evidence regulators and claimants will be looking for.

What this means if you use AI in hiring

Four practical conclusions follow directly from the research.

01Measure adverse impact at the position level, not the portfolio level. If your vendor reports a single aggregate fairness figure, you do not yet know whether you are compliant. You know that someone averaged.
02Do not accept a vendor’s self-assessment as your assurance. The vendor in this study was not acting in bad faith; it was answering the wrong question with its own tools. Independence is the only reason these disparities came to light at all.
03Account for concentration. If you and your competitors all screen through the same model, you are not diversifying your judgement, you are syndicating one. Ask what that means for the candidates you never see and for the systemic risk you are quietly importing.
04Treat the August 2026 deadline as a planning date, not a filing date. The work of evidencing fairness, documenting trade-offs, and standing them up for an auditor or a Board takes longer than the paperwork suggests.

The missing layer: independent assessment

The most quietly devastating line in the paper is the one about why the study was even possible. It happened because the vendor voluntarily shared its data under an agreement that protected the researchers’ independence. The authors are clear that independent research is what illuminates otherwise opaque hiring algorithms, and equally clear that findings like these could discourage the next vendor from ever opening the door.

That is the structural hole this study exposes, and it is the one validant.ai exists to fill. The lesson here is not that AI hiring is uniquely evil. It is that fairness cannot be certified by the same party that builds and sells the system, measured with the metric most likely to flatter it, at the level of aggregation least likely to reveal a problem. Fairness needs an independent read: position-level, evidenced, transparent about who the system advantages and who it is willing to let lose, and accountable to the people who carry the legal and reputational risk.

This is exactly what we built validant.ai to do. We run position-level fairness assessment against the four-fifths rule and the other lenses each domain demands, we keep the bias diagnosis and the evidence separable from any claim of a clean result, and we produce a read that an auditor, a journalist, or a regulator can actually interrogate. Not a verdict that ends the conversation. An evidence base that makes the conversation accountable.

“No system has ever been fair, and a single number will never make one so. What we can do is ask the right question, at the right level, and then prove our answer to someone who has no incentive to like it.”
The Wrong Question, Asked at Scale

Read the study, then look at your own stack

Read the full paper, “Algorithmic Monocultures in Hiring,” at algorithmichiring.github.io/paper.pdf, and Fortune’s coverage by Nick Lichtenberg at fortune.com.

Then ask the question the study forces: if someone analysed your hiring tools position by position, against the four-fifths rule, what would they find? If you are not certain of the answer, that uncertainty is the finding.

At validant.ai we build independent, position-level fairness assessment for AI systems, designed for the evidence standard the EU AI Act will require from 2 August 2026. If you want to know what your hiring stack actually does before a regulator, a journalist, or a researcher tells you, get in touch.

Find out what your hiring stack actually does, position by position, before someone else does. Book a demo and see an independent, evidenced fairness read.

Daniel Glinz works on AI fairness, digital trust, and regulatory readiness, and is the creator of validant.ai.

Share this post

No System Has Ever Been Fair

What four breakout sessions, one fairness tool demo, and 50+ years of collective experience taught us about fairness in AI, at the Trustworthy AI Circle.

Read

ResearchOpen to read

21 May 2026

Bias is the Foundation

Why every fairness claim begins with a bias diagnosis, and why skipping it breaks everything downstream.

Read

Blanco-style technical line drawing of Lucerne’s covered wooden Chapel Bridge and octagonal Water Tower over the Reuss, with Mount Pilatus behind and a soft coral wash in the sky.

EventsOpen to read

30 May 2026

Two Views of One Decision: Trustworthy and Explainable AI in Practice at HSLU

Notes from a Lucerne specialists course on Trustworthy and Explainable AI, and what it confirms about the Validant.ai approach.

Read

Back to all updates