Safety you can't price is safety you can't trust.

Ollive AI Risk Labs studies how AI agents fail, what those failures cost, and what evidence it takes to deploy them anyway.

Capability has a thousand benchmarks.Consequence has none. Until now.

Every transformative technology got its safety institution from the same place.

In 1894, fire insurers founded Underwriters Laboratories because nobody could tell a safe electrical product from a fire waiting to happen. Auto insurers built the crash test and the IIHS. Boiler insurers ended the era of exploding steam engines by inspecting what they covered. Each time a technology outran society’s ability to judge its risk, the institution that closed the gap was built by the people with money on the outcome.

AI agents are that technology now. Ollive AI Risk Labs is that institution.

The benchmark era has failed the people who rely on it.

Today’s leaderboards were built to measure capability, not consequence. They saturate, they leak into training data, and they reward the test rather than the task. Researchers have driven industry-standard agent benchmarks to perfect scores without solving a single task. Frontier models demonstrably behave better under evaluation than in deployment. A regulator, a hospital, or a court gets almost no decision-useful information from a leaderboard position.

The field doesn’t need another ranking. It needs measurement infrastructure that survives contact with adversaries, with production, and with litigation.

We measure safety in dollars, because dollars are how society already adjudicates harm.

A hallucination is not an abstraction. It is a refund, a lawsuit, a regulatory proceeding, a discrimination claim. Every failure mode an agent can produce maps to a loss pathway: who relied on the output, what harm followed, what legal theory attaches, what it costs to defend and settle.

The Labs traces that full chain. Failure mode to harm. Harm to liability. Liability to dollars. Courts, regulators, and boards have used this unit of account for two centuries. AI safety research should be legible to it.

Skin in the game is the strongest peer review.

Anyone can publish a score. Ours gets underwritten. Findings from the Labs feed real insurance policies with real limits and real claims. If our measurement is wrong, we pay for the error. That feedback loop, loss data correcting risk models year after year, is the same discipline that made fire, auto, and aviation progressively safer.

It is also an epistemic standard no leaderboard, certification, or self-attestation is held to. A bet is an opinion that faces consequences.

Safety is not only what attackers do to agents. It is what agents do with autonomy.

Much of the field studies how agents get broken: injections, jailbreaks, poisoned tools, corrupted retrieval. Necessary, and not sufficient. An agent with tools, memory, and authority to act can produce harm with no adversary in the loop. It can commit its principal to obligations, discriminate at scale across thousands of decisions, disclose what it was trusted with, and pursue its objective straight through a boundary nobody thought to write down.

The Labs studies both surfaces: the exploited agent and the obedient agent that was simply wrong. Any definition of safe that covers less is incomplete.

Insurers hold the evidence academia has never been able to reach.

The hardest problem in AI safety research is not method. It is ground truth. Academic labs study failures in sandboxes; the failures that matter happen in production, inside companies, behind NDAs, and end up as confidential settlements. Insurance is the one institution that sees them systematically: as claims, incidents, and near misses, with severity attached.

The Labs exists to put that evidence base to work: production-grade agent systems, real deployment contexts, and a growing corpus of incident and loss data, studied with academic rigor and published for the field.

We open the method. We protect the test.

Our taxonomies, methodology, and findings are published for researchers, standards bodies, and regulators to use, replicate, and challenge. Our test suites stay private and rotating, so no model trains on the exam and no vendor optimizes to the answer key. Open science where openness compounds trust. Held-out rigor where openness would destroy the instrument.

Markets can produce governance evidence faster than legislation can.

Regulators worldwide are converging on the same requirements: independent testing, adversarial evaluation, demonstrable risk management. What they lack is the measurement infrastructure to make those requirements real. Insurance built that infrastructure for fire, autos, and aviation, and it can build it for AI: standards that update at the speed of the technology, enforced by premiums rather than penalties, generating the actuarial record that good regulation eventually needs.

We are building that record in public view.

This is an invitation.

Ollive AI Risk Labs hosts researchers from academia and industry to build AI RiskBench: a systematic study of loss pathways, failure modes, and risk families across agent architectures, modalities, models, and sectors. Its output is the Agent Trust Score, a measurement designed so developers can release with evidence and society can adopt without blind faith.

If you believe AI should be deployed boldly and measured honestly, come build the evidence base with us.

The invitation

The world will run on agents.
Someone has to underwrite that.

Talk to us

Ollive_Risk_Labs_Manifesto.md

# Ollive AI Risk Labs: Manifesto

**Safety you can't price is safety you can't trust.**

Ollive AI Risk Labs is a research institution studying how AI agents fail, what those failures cost, and what evidence it takes to deploy them anyway.

---

## Every transformative technology got its safety institution from the same place.

In 1894, fire insurers founded Underwriters Laboratories because nobody could tell a safe electrical product from a fire waiting to happen. Auto insurers built the crash test and the IIHS. Boiler insurers ended the era of exploding steam engines by inspecting what they covered. Each time a technology outran society's ability to judge its risk, the institution that closed the gap was built by the people with money on the outcome.

AI agents are that technology now. Ollive AI Risk Labs is that institution.

## The benchmark era has failed the people who rely on it.

Today's leaderboards were built to measure capability, not consequence. They saturate, they leak into training data, and they reward the test rather than the task. Researchers have driven industry-standard agent benchmarks to perfect scores without solving a single task. Frontier models demonstrably behave better under evaluation than in deployment. A regulator, a hospital, or a court gets almost no decision-useful information from a leaderboard position.

The field doesn't need another ranking. It needs measurement infrastructure that survives contact with adversaries, with production, and with litigation.

## We measure safety in dollars, because dollars are how society already adjudicates harm.

A hallucination is not an abstraction. It is a refund, a lawsuit, a regulatory proceeding, a discrimination claim. Every failure mode an agent can produce maps to a loss pathway: who relied on the output, what harm followed, what legal theory attaches, what it costs to defend and settle.

The Labs traces that full chain. Failure mode to harm. Harm to liability. Liability to dollars. Courts, regulators, and boards have used this unit of account for two centuries. AI safety research should be legible to it.

## Skin in the game is the strongest peer review.

Anyone can publish a score. Ours gets underwritten. Findings from the Labs feed real insurance policies with real limits and real claims. If our measurement is wrong, we pay for the error. That feedback loop, loss data correcting risk models year after year, is the same discipline that made fire, auto, and aviation progressively safer.

It is also an epistemic standard no leaderboard, certification, or self-attestation is held to. A bet is an opinion that faces consequences.

## Safety is not only what attackers do to agents. It is what agents do with autonomy.

Much of the field studies how agents get broken: injections, jailbreaks, poisoned tools, corrupted retrieval. Necessary, and not sufficient. An agent with tools, memory, and authority to act can produce harm with no adversary in the loop. It can commit its principal to obligations, discriminate at scale across thousands of decisions, disclose what it was trusted with, and pursue its objective straight through a boundary nobody thought to write down.

The Labs studies both surfaces: the exploited agent and the obedient agent that was simply wrong. Any definition of safe that covers less is incomplete.

## Insurers hold the evidence academia has never been able to reach.

The hardest problem in AI safety research is not method. It is ground truth. Academic labs study failures in sandboxes; the failures that matter happen in production, inside companies, behind NDAs, and end up as confidential settlements. Insurance is the one institution that sees them systematically: as claims, incidents, and near misses, with severity attached.

The Labs exists to put that evidence base to work: production-grade agent systems, real deployment contexts, and a growing corpus of incident and loss data, studied with academic rigor and published for the field.

## We open the method. We protect the test.

Our taxonomies, methodology, and findings are published for researchers, standards bodies, and regulators to use, replicate, and challenge. Our test suites stay private and rotating, so no model trains on the exam and no vendor optimizes to the answer key. Open science where openness compounds trust. Held-out rigor where openness would destroy the instrument.

## Markets can produce governance evidence faster than legislation can.

Regulators worldwide are converging on the same requirements: independent testing, adversarial evaluation, demonstrable risk management. What they lack is the measurement infrastructure to make those requirements real. Insurance built that infrastructure for fire, autos, and aviation, and it can build it for AI: standards that update at the speed of the technology, enforced by premiums rather than penalties, generating the actuarial record that good regulation eventually needs.

We are building that record in public view.

## This is an invitation.

Ollive AI Risk Labs hosts researchers from academia and industry to build **AI RiskBench**: a systematic study of loss pathways, failure modes, and risk families across agent architectures, modalities, models, and sectors. Its output is the **Agent Trust Score**, a measurement designed so developers can release with evidence and society can adopt without blind faith.

If you believe AI should be deployed boldly and measured honestly, come build the evidence base with us.

---

> The world will run on agents. Someone has to underwrite that.