July 10, 2025

Automating Data Quality with Ontology-Guided AI at GLEIF

Reading time: 4 minutes

Teaching AI to Govern Data: How GLEIF uses ontologies and LLMs to create reliable checks

Most AI discussions today start with the same refrain: “Good data is everything.” But what if AI could create better data, not just consume it? At their presentation at this year’s DATA festival in Munich, the team of the Global Legal Entity Identifier Foundation (GLEIF) turned the conversation around. Instead of blaming poor data for AI underperformance, they showed how large language models (LLMs), when guided by domain-specific ontologies, can help automate data quality governance – and scale it for an increasingly complex regulatory environment.

Setting the Scene: Who Is GLEIF and What Is a LEI?

GLEIF is a supra-national not-for-profit organization headquartered Switzerland that ensures the operational stability of the Global LEI (legal Entity Identifier) System. The LEI is a 20-character alphanumeric code based on the ISO Standard 17442 that uniquely identifies legal entities across the world. Originally established in the wake of the 2008 financial crisis, the LEI system ensures transparency around two questions in financial transactions: “Who is who?” and “Who owns whom?”

Today, over 2.9 million LEIs exist – and each one includes structured reference data like name, address, local identifiers, and ownership structures.

GLEIF takes pride in the quality of this data, even though it’s openly available. Their rigorous data quality framework is based on daily validations, pre-submission APIs for issuing organizations, and a large suite of binary checks (pass/fail/not applicable) run on every record.

The Old Way: Manual Rule Creation from Regulatory Texts

Traditionally, these data quality checks are derived from regulatory policy documents, often long and written in natural language. The process involved:

Parsing policies into semi-structured rule documents (using “shall” and “shall not”).
Converting those rules into a machine-readable custom rule language.
Implementing them in PHP for daily execution.

This approach, while thorough, is time- and labor-intensive, and open to human interpretation errors. Adding new checks risks contradictions and ensuring consistency across a growing dataset is difficult.

The New Approach: Using LLMs to Automate Rule Extraction

To improve efficiency and scalability, the team developed a new pipeline using LLMs and ontologies to automate rule generation:

Preprocessing the Input
- Documents are broken into context-aware chunks.
- Key entities are tagged.
- Tables and visuals are converted to text.
- Irrelevant content is filtered out based on entity mentions.
Relationship Extraction with Ontology-Guided Prompts
- LLMs use ontologies and a small number of examples to identify relevant relationships.
- This step uses chain-of-thought prompting to ensure reasoning paths are made explicit.
Translating Relationships into Check Descriptions
- Multiple translation paths are explored.
- The best candidates are selected using logic checks or LLM-as-a-judge mechanisms.
- Rules are formalized for downstream use.

Why Ontologies Matter

Providing the LLM with a well-defined ontology acts as a constraint mechanism. It reduces hallucination and keeps the outputs within the domain-specific framework. It also enhances accuracy, as the model can “reason” about relationships between regulatory concepts more reliably.

Quality Assurance: Contradiction and Overlap Checks

Before new checks are implemented, they are validated against a graph of existing conditional checks. If overlaps or contradictions are found, the system flags them for review avoiding the nightmare of conflicting validations.

An example:

✅ “An end node of a branch relationship cannot be a branch entity” → passes integrity.
❌ “A start node cannot relate to itself” → overlaps with an existing check.

Toward Self-Improving Systems

This isn’t just a one-off pipeline. The team is looking to fine-tune models using reinforcement learning, designing reward functions for each verifiable step in the pipeline. The goal: A system that learns where ambiguities lie, not just in the regulations but in the organization’s own interpretations.

Interestingly, many observed model “failures” weren’t due to hallucination. Instead, they exposed unclear wording in the original documents or internal inconsistencies – leading to better human processes in parallel.

Real-World Impact and Governance Model

GLEIF itself cannot alter data. That is the issuing organizations’ responsibility. But by offering API-based validation tools and a consistent rule base, they give partners the means to validate before upload. With over 200 checks in use (and growing), automating this governance process has proven to be not just efficient, but essential.

Key Takeaways

Start with ontologies, not just prompts. Domain structure guides better extractions.
AI doesn’t just need clean data – it can help create it.
Decomposing tasks enables fine-tuning. Even in highly specific domains.
Contradiction detection matters. Rule overlap in complex systems must be managed.
Failures are feedback. When LLMs struggle, it is often your own process that is unclear.

Conclusion:

GLEIF’s AI journey is a masterclass in domain-driven automation. With a growing regulatory burden and limited manual capacity, their system demonstrates how combining structured domain knowledge with cutting-edge LLM techniques can enable AI to not just scale but govern.

The Speakers for this session did apply at the official Call for Presentation for DATA festival Munich. The Application Period for the Upcoming DATA festival in Munich in June 16-17th, 2026 and the DATA festival online on October 21st is now open.
The Events give a perfect opportunity for Data- and AI-Enthusiasts to share their experiences and discuss it with peers.

DATA festival online: Join us virtually – for free!

Event | October 21, 2025 | Online

Are you ready to harness the full potential of data and AI for your organization? DATA festival brings together the brightest minds in the data community for an immersive experience filled with cutting-edge insights and collaborative opportunities.

Keep an eye out for updates!