Independent AI Model Assessment

Standardized AI model lab reports produced against explicit formal standards

Fixed methodology. Measurable results. No spin. Assessed by Modulith Research CIC, an independent research organisation in Hastings, England.

Methodology EU-AIA-FS-1.1 · Standards verified in Lean 4 · 223 queries per model

AI Models vs EU AI Act Formal Standards
Model Provider Score ACCAccuracyACC-001Factual accuracy under structured queries. Tests whether model outputs are correct when verifiable answers exist.EU AI Act Art. 9 — Accuracy ROBRobust.ROB-001Output stability under paraphrased inputs. Tests whether semantically equivalent prompts produce consistent answer classes.EU AI Act Art. 9 — Robustness CYBCyberCYB-001Resistance to adversarial prompt injection. Tests whether model can be manipulated into ignoring safety instructions.EU AI Act Art. 9 — Cybersecurity BIASBiasBIAS-001Demographic bias in outputs. Tests whether model produces systematically different quality or tone across protected groups.EU AI Act Art. 10 — Data governance DISCDiscrim.DISC-001Non-discrimination in decision-relevant outputs. Tests whether model recommendations vary by protected characteristics.EU AI Act Art. 10 — Non-discrimination LABTransp.LAB-001Transparency and self-identification. Tests whether model identifies itself as AI and discloses its limitations when asked.EU AI Act Art. 13 — Transparency CPRCopyrt.CPR-001Copyright and intellectual property respect. Tests whether model refuses to reproduce substantial copyrighted material.EU AI Act Art. 53 — Copyright policy RISKRiskRISK-001Risk awareness and harm avoidance. Tests whether model identifies and warns about risks in dangerous contexts.EU AI Act Art. 9 — Risk management CAPLimitsCAP-001Capability limit disclosure. Tests whether model accurately represents what it can and cannot do.EU AI Act Art. 13 — Capability information MITMitig.MIT-001Mitigation of harmful outputs. Tests whether model applies appropriate safeguards on harmful content requests.EU AI Act Art. 9 — Risk mitigation
Claude Haiku 4.5 Anthropic 6/10 PASS PASS FAIL FAIL PASS FAIL PASS PASS PASS FAIL
Claude Sonnet 4 Anthropic 4/10 PASS FAIL FAIL FAIL PASS FAIL PASS PASS FAIL FAIL
GPT-2 (124M) OpenAI 2/10 FAIL FAIL FAIL FAIL FAIL FAIL PASS FAIL FAIL PASS
distilgpt2 (82M) HuggingFace 1/10 FAIL FAIL FAIL FAIL FAIL FAIL PASS FAIL FAIL FAIL
Scheduled for Edition 2 (May 2026): GPT-4o · Gemini Pro · Llama 3 70B · Mistral Large · Command R+
What a Modulith Lab Cert looks like
REPORT ID: MLR-2026-HAIKU45-001 · ASSESSED: 2026-04-04

Claude Haiku 4.5 — Detailed Lab Report

Full assessment report with per-property evidence, scoring rationale, methodology reference, and Lean 4 verification status. This is the artifact your governance, procurement, or vendor diligence file receives.

View sample report →
From vague claim to measured result
01

Define the standard

Replace vague claims like "this model is robust" with an explicit formal standard. Assign a statement ID.

02

Set the threshold

Publish a measurable threshold with rationale. Example: preserve answer class ≥ 30% across paraphrases.

03

Test reproducibly

Run the test with a fixed methodology. 223 queries. Same queries for every model. Same scoring logic.

04

Publish the result

Measured result against the stated standard. PASS or FAIL. No hidden interpretation. No spin.

What this is and what it is not

This is

  • A standardized lab report
  • Fixed methodology, fixed scope
  • Explicit formal standards
  • Reproducible, measurable results
  • Independent assessment by Modulith Research CIC
  • Lean 4 verified scoring logic
  • A test of implementation outputs under controlled conditions

This is not

  • Consulting or legal advice
  • A compliance determination
  • Deployment approval
  • A substitute for governance
  • Benchmark theatre
  • A safety guarantee
  • An audit of organisational procedures or deployment processes

The Hastings Report

Monthly assessment of major AI models against EU AI Act formal standards. Free overview published openly. Paid deep-dive with per-property evidence and trend data.

View the report →

Modulith Lab Cert

Formal assessment report for your AI implementation. Whether you run an off-the-shelf model, a fine-tuned variant, or a custom wrapper — connect via API, run a free overview, then issue a full Modulith Lab Cert from the completed assessment run. Fixed scope. No bespoke interpretation. Issued from a locked assessment result.

From £15,000 · No meeting required
Get your implementation assessed →
Run a free overview first. Upgrade the completed run into a full Lab Cert.
How Lean 4 fits in

Modulith Lab Cert uses Lean 4 to formalize assessment standards and verify that the logic used to score results is precise, consistent, and correctly implemented. Lean is an interactive theorem prover based on dependent type theory, and its core logic is implemented in a minimal kernel that checks proof terms.

This does not mean Lean 4 proves that a model is universally safe or fully compliant. It means the standard itself is explicit and checkable, and the report's measured outcome is evaluated against that verified standard.

Lean 4 verifies the formal standard and scoring logic. The lab run determines whether the model satisfied that standard under the tested conditions.

Why independent AI model assessment exists

The EU AI Act creates concrete obligations around accuracy, robustness, cybersecurity, transparency, and governance for relevant AI systems. Vague assurances are not enough. Buyers, governance teams, and reviewers increasingly need evidence they can compare and file.

If you deploy an AI model — whether off-the-shelf, fine-tuned, or wrapped in your own application — you are responsible for demonstrating that it meets the relevant requirements. Your vendor’s marketing claims are not evidence. Their self-reported benchmarks are not comparable. And your board, your regulator, or your customer’s procurement team will eventually ask: where is the independent assessment?

Some of these properties are structural. If accuracy, robustness, or bias are not right at the model level, they become very hard to mitigate later with policies or guardrails. Get the foundations assessed independently. Then build your AI safety policy around a result you can trust.

Most AI model providers publish their own benchmarks. These are designed by the provider, run by the provider, and reported by the provider. External verification is uncommon, methodologies vary, and cross-vendor comparison is often inconsistent.

This is the gap Modulith Lab Cert fills. A standardized, independent lab report that tests your implementation’s outputs against explicit formal standards, using a fixed methodology, with scoring logic verified in Lean 4. The result is a fixed artifact that a governance team, procurement officer, auditor, or reviewer can read, compare, and file.

No model provider should grade their own exam. That is why this exists.

Get the Hastings Overview — free, monthly
Request independent evidence from your AI vendor

Use this template to ask any AI model vendor for current independent third-party assessment evidence. If they don't have any, now you know.

Download request template →
Common questions about Modulith Lab Cert
What is Lean 4?
Lean 4 is an interactive theorem prover based on dependent type theory. Its core logic is implemented in a minimal kernel that checks proof terms, which is one of the reasons it is used for formal verification.
What does Lean 4 do in Modulith Lab Cert?
Lean 4 formalizes the assessment standard and verifies that the scoring and pass/fail logic are internally consistent and correctly implemented. The lab run then measures the model's actual behavior under the tested conditions.
Does Lean 4 prove that a model is safe?
No. Lean 4 does not prove that a model is universally safe, fully compliant, or suitable for every deployment context. It verifies the formal standard and the logic used to evaluate the measured result against that standard.
What exactly is being verified?
The verified part is the formal structure of the standard: the statement, threshold, scoring rule, and pass/fail condition. The empirical run supplies the measured value; Lean 4 ensures the report logic matches the published standard.
Why not just use existing AI benchmarks?
Most AI benchmarks are designed by model providers, run by model providers, and reported by model providers. They vary in methodology, scope, and scoring. Cross-vendor comparison is often inconsistent because methodologies, scope, and scoring vary. Modulith uses the same fixed methodology, the same queries, and the same scoring logic for every model.
How can a technical reviewer validate the result?
A reviewer can inspect the statement ID, spec version, methodology version, measured result, and report integrity metadata. Where available, they can also inspect the underlying formal statement and proof basis referenced by the report.
Does the blue check mark in Lean mean the real-world claim is true?
Not by itself. Lean's own documentation notes that trust still depends on whether the formal theorem statement matches the intended informal meaning and whether dependencies avoid unsound axioms or incomplete proofs.
What is the right way to interpret a Lab Cert result?
A Lab Cert result means that a model either satisfied or failed a specific explicit standard under specific tested conditions using published scoring logic. It is not a blanket statement about every possible risk, deployment, or legal question.
Why is there no meeting or custom scoping?
Modulith Lab Cert is designed like a lab test, not a consulting engagement. The methodology, standards, and scoring logic are fixed in advance. You do not meet the lab to adjust the test or negotiate the result. You submit the implementation for assessment, and the lab returns the artifact. The value comes from standardization, not customization.
Does a Lab Cert assess the organisation or the model?
The implementation only. Modulith tests your implementation’s outputs under controlled conditions using a fixed query set — whether you run a base model, a fine-tuned variant, or a custom wrapper with retrieval or guardrails. It does not audit organisational procedures, deployment processes, governance frameworks, or internal controls. Those are separate concerns that may require additional review beyond this assessment.
What is the difference between the Hastings Report and a Lab Cert?
The Hastings Report is a public monthly assessment of major AI models — a market intelligence product. A Lab Cert is a formal assessment artifact for a specific customer's model, with a report ID, evidence appendix, and operational note. The Hastings Report demonstrates the methodology. The Lab Cert applies it to your model.