Work Transformers Labs · Research

Project Thunderpoint™

Applied research into the harness around AI: the guardrails, verification, context control and human checkpoints that make AI dependable enough for high-stakes operational work.

Applied researchOngoing programme
The state of play
Reliability is engineered around the model, not prompted into it.

Most AI projects do not fail because the model is not clever enough. They fail because nothing around the model keeps it honest, bounded and checkable. As autonomy rises, that gap is becoming the main reason serious deployments stall.

40%+
of agentic AI projects will be cancelled by 2027
Gartner · 2026
60%+
of agentic orchestrations miss performance or cost targets by 2030
Gartner · 2026
2,000+
AI-related legal claims expected by the end of 2026
Gartner · 2026
The silent failure
The dangerous failure is the one that reports success.

An agent hits a wall it cannot get past, and reports the task done anyway. Systems do exactly what they are told, not what was meant, and small errors compound at machine speed. In a deal pack or a cost review, a confident wrong answer is worse than no answer at all.

Does what it is told

Not what was meant. The instruction is followed literally, even when the situation has changed.

Reports false success

The system claims a result it never achieved, and nothing checks the claim against what it did.

Compounds quietly

One small error feeds the next, at machine speed, with no alert until the output is already wrong.

The research question

Project Thunderpoint studies the harness around the model: what has to sit around an AI system so a serious firm can rely on its output. Not a better prompt, a better control system.

Not a prompt problem
The harness, not the prompt.

The most reliable systems are not the ones with the cleverest instructions. They are the ones wrapped in a harness: bounds on what the system can do, an independent check on what it actually did, deterministic handling of the predictable parts, and a person at the decisions that matter.

Guardrails and bounds

Caps on iterations, scope and spend, so a stuck system fails safely rather than spinning.

Independent verification

A separate check reads what the system did, not what it claimed, and can halt it.

Context kept clean

Long, noisy context is compacted, so the instruction that matters is not lost.

Deterministic handlers

The parts that should never vary are coded, not generated, and behave the same every time.

Escalation gates

Anything beyond a threshold or below a confidence level routes to a named reviewer.

Human sign-off

Automation carries the load; the judgement and the sign-off stay with a person.

How it fits together
Dependability comes from the harness, not the prompt.

Outputs are tested, guarded, evaluated and traceable, so the work can be trusted where it matters most.

Model outputTest casesReal dataEdge casesRulesCaught failuresConfidence scoresClear escalationFull traceabilityDependable outputTrust at ICSpecTestGuardrailsEvaluateDetect driftEscalateVerifyLogCertifyWHAT GOES INTHE HARNESSWHAT YOU CAN TRUST
Design principles
Six principles for AI you can depend on.
01
A verifier with no shared incentive

A separate check reads what the system actually did, not what it claimed, and can stop it.

02
Bounded autonomy

Caps on iterations, scope and spend, so a stuck system fails safely instead of spinning.

03
Deterministic handling of the predictable

The parts that must never vary are coded, not generated, so they behave the same every time.

04
Context kept clean

Long, noisy context is compacted, so the system does not lose the instruction that matters.

05
Escalation by design

Anything beyond threshold or below confidence routes to a named reviewer automatically.

06
A person at the decision

Automation handles the volume; judgement and sign-off remain human.

What clients can learn

Firms working with us through Thunderpoint get an honest read on where their AI can and cannot be trusted, and what harness it would take to make it dependable in real operations. It is the same harness we build into every deployment.

Where this is heading
The winners will not have the cleverest models. They will have the most dependable systems.

As more work is handed to AI, dependability becomes the constraint on value, not raw capability. Most agentic projects that fail will fail on reliability and control, not intelligence. Engineering the harness is how that gap closes.

Related research and outputs
How we publish

What we share, and what we keep.

Project Thunderpoint is open research into reliability: the harness of guardrails, verification and human checkpoints around the model. We publish the questions we are wrestling with, the patterns that recur across real deployments, and the principles that hold up under pressure. We do not publish client data, or the parts of the method still being worked out. It is research, not a product: the questions are the interesting part, and the answers are earned in the work.

Work Transformers Labs

Following the reliability research?