Standardized AI model lab reports produced against explicit formal standards
Fixed methodology. Measurable results. No spin. Assessed by Modulith Research CIC, an independent research organisation in Hastings, England.
Methodology EU-AIA-FS-1.1 · Standards verified in Lean 4 · 223 queries per model
| Model | Provider | Score | ACCAccuracyACC-001Factual accuracy under structured queries. Tests whether model outputs are correct when verifiable answers exist.EU AI Act Art. 9 — Accuracy | ROBRobust.ROB-001Output stability under paraphrased inputs. Tests whether semantically equivalent prompts produce consistent answer classes.EU AI Act Art. 9 — Robustness | CYBCyberCYB-001Resistance to adversarial prompt injection. Tests whether model can be manipulated into ignoring safety instructions.EU AI Act Art. 9 — Cybersecurity | BIASBiasBIAS-001Demographic bias in outputs. Tests whether model produces systematically different quality or tone across protected groups.EU AI Act Art. 10 — Data governance | DISCDiscrim.DISC-001Non-discrimination in decision-relevant outputs. Tests whether model recommendations vary by protected characteristics.EU AI Act Art. 10 — Non-discrimination | LABTransp.LAB-001Transparency and self-identification. Tests whether model identifies itself as AI and discloses its limitations when asked.EU AI Act Art. 13 — Transparency | CPRCopyrt.CPR-001Copyright and intellectual property respect. Tests whether model refuses to reproduce substantial copyrighted material.EU AI Act Art. 53 — Copyright policy | RISKRiskRISK-001Risk awareness and harm avoidance. Tests whether model identifies and warns about risks in dangerous contexts.EU AI Act Art. 9 — Risk management | CAPLimitsCAP-001Capability limit disclosure. Tests whether model accurately represents what it can and cannot do.EU AI Act Art. 13 — Capability information | MITMitig.MIT-001Mitigation of harmful outputs. Tests whether model applies appropriate safeguards on harmful content requests.EU AI Act Art. 9 — Risk mitigation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Haiku 4.5 | Anthropic | 6/10 | PASS | PASS | FAIL | FAIL | PASS | FAIL | PASS | PASS | PASS | FAIL |
| Claude Sonnet 4 | Anthropic | 4/10 | PASS | FAIL | FAIL | FAIL | PASS | FAIL | PASS | PASS | FAIL | FAIL |
| GPT-2 (124M) | OpenAI | 2/10 | FAIL | FAIL | FAIL | FAIL | FAIL | FAIL | PASS | FAIL | FAIL | PASS |
| distilgpt2 (82M) | HuggingFace | 1/10 | FAIL | FAIL | FAIL | FAIL | FAIL | FAIL | PASS | FAIL | FAIL | FAIL |
Claude Haiku 4.5 — Detailed Lab Report
Full assessment report with per-property evidence, scoring rationale, methodology reference, and Lean 4 verification status. This is the artifact your governance, procurement, or vendor diligence file receives.
Define the standard
Replace vague claims like "this model is robust" with an explicit formal standard. Assign a statement ID.
Set the threshold
Publish a measurable threshold with rationale. Example: preserve answer class ≥ 30% across paraphrases.
Test reproducibly
Run the test with a fixed methodology. 223 queries. Same queries for every model. Same scoring logic.
Publish the result
Measured result against the stated standard. PASS or FAIL. No hidden interpretation. No spin.
This is
- A standardized lab report
- Fixed methodology, fixed scope
- Explicit formal standards
- Reproducible, measurable results
- Independent assessment by Modulith Research CIC
- Lean 4 verified scoring logic
- A test of implementation outputs under controlled conditions
This is not
- Consulting or legal advice
- A compliance determination
- Deployment approval
- A substitute for governance
- Benchmark theatre
- A safety guarantee
- An audit of organisational procedures or deployment processes
The Hastings Report
Monthly assessment of major AI models against EU AI Act formal standards. Free overview published openly. Paid deep-dive with per-property evidence and trend data.
View the report →Modulith Lab Cert
Formal assessment report for your AI implementation. Whether you run an off-the-shelf model, a fine-tuned variant, or a custom wrapper — connect via API, run a free overview, then issue a full Modulith Lab Cert from the completed assessment run. Fixed scope. No bespoke interpretation. Issued from a locked assessment result.
Modulith Lab Cert uses Lean 4 to formalize assessment standards and verify that the logic used to score results is precise, consistent, and correctly implemented. Lean is an interactive theorem prover based on dependent type theory, and its core logic is implemented in a minimal kernel that checks proof terms.
This does not mean Lean 4 proves that a model is universally safe or fully compliant. It means the standard itself is explicit and checkable, and the report's measured outcome is evaluated against that verified standard.
Lean 4 verifies the formal standard and scoring logic. The lab run determines whether the model satisfied that standard under the tested conditions.
The EU AI Act creates concrete obligations around accuracy, robustness, cybersecurity, transparency, and governance for relevant AI systems. Vague assurances are not enough. Buyers, governance teams, and reviewers increasingly need evidence they can compare and file.
If you deploy an AI model — whether off-the-shelf, fine-tuned, or wrapped in your own application — you are responsible for demonstrating that it meets the relevant requirements. Your vendor’s marketing claims are not evidence. Their self-reported benchmarks are not comparable. And your board, your regulator, or your customer’s procurement team will eventually ask: where is the independent assessment?
Some of these properties are structural. If accuracy, robustness, or bias are not right at the model level, they become very hard to mitigate later with policies or guardrails. Get the foundations assessed independently. Then build your AI safety policy around a result you can trust.
Most AI model providers publish their own benchmarks. These are designed by the provider, run by the provider, and reported by the provider. External verification is uncommon, methodologies vary, and cross-vendor comparison is often inconsistent.
This is the gap Modulith Lab Cert fills. A standardized, independent lab report that tests your implementation’s outputs against explicit formal standards, using a fixed methodology, with scoring logic verified in Lean 4. The result is a fixed artifact that a governance team, procurement officer, auditor, or reviewer can read, compare, and file.
No model provider should grade their own exam. That is why this exists.
Use this template to ask any AI model vendor for current independent third-party assessment evidence. If they don't have any, now you know.
Download request template →