From Hype to Production: How Merck, Siemens and Google Engineer Reliable AI Systems

Reading time: 10 minutes
This article synthesizes insights from three sessions at the DATA festival Online 2025: "Generative AI @ Scale for Enterprises" by Dr. Harsha Gurulingappa (Merck), "From Concept to Factory Floor: Designing Robust AI Architectures for Industrial Applications" by Dr. Nikita Golovko (Siemens), and "Reliability-first Generative AI: Turning Models into Business-Ready Systems" by Nitesh Singhal (Google).

Here is a story that should give every AI leader pause.

A team builds a computer vision model to detect surface defects on metal strips. In the lab, it performs brilliantly – 98% accuracy. They deploy it to the factory floor. Within a week, accuracy drops to 70%. Operators lose trust. They roll back to manual inspection.

What went wrong? Not the model. The architecture.

Changing lighting conditions, dust accumulating on sensors, equipment degradation, operators using tools in unexpected ways – none of this existed in the lab. The model was fine. Everything around it was not.

This story, shared by Dr. Nikita Golovko from Siemens at the DATA festival Online 2025, captures a truth that three very different speakers – from Siemens, Merck, and Google – independently confirmed across their sessions: reliable AI is just 20% about modeling. But what about the other 80 % ?

Three companies. Three industries. One conclusion.

Merck: GenAI for 31,000 Employees – What It Actually Takes

Dr. Harsha Gurulingappa leads GenAI infrastructure at Merck, a company operating across healthcare, life science, and electronics. Merck has deployed GenAI to over 31,000 active users. That is not a pilot. That is enterprise scale.

The foundation rests on three pillars: people, standardized operating models, and technology. None of these pillars is optional.

Three tiers of AI access

Merck structures its GenAI offering into three categories:

  • Everyday AI – An enterprise assistant called “myGPT” accessible to every employee. The goal: reduce the entry barrier to zero. 31,000 active users have created 17,000 custom assistants for specific tasks.
  • AI for data natives – A specialized platform (Palantir Foundry) for teams working with highly curated data from Merck’s internal data lake. Purpose-built for complex analytical use cases.
  • Developer toolkit – A modular system where developers pick and choose from inferencing services, vector databases, MLOps, coding assistants, and agent frameworks. High flexibility for custom solutions.

This approach ensures that different personas are served best, making everything from business user enablement to scalable, efficient engineer-built AI agents possible.

Governance is not optional at 31,000 users

At this scale, governance becomes existential. Merck operates across five governance dimensions:

  • Documentation – Every use case has documented purpose, owners, linked data, associated risks, KPIs, and timelines.
  • Process and standards – Access management, change management, and release management follow defined protocols aligned with enterprise software standards.
  • Security – Identity management, authentication protocols, VPC peering for third-party cloud systems, and machine-to-machine permission handling. Non-negotiable for a highly regulated organization.
  • Transparency – Two-way: developers make their work discoverable for reuse; the platform makes consumption costs and artifact metadata visible to teams.
  • Enablement – Training materials, guides, and multi-format training sessions. A complex tech stack only creates value if people actually understand how to use it.

The observability imperative

One point Gurulingappa stressed repeatedly: observability is not a nice-to-have. It is a critical component.

Observability means tracing every step of an LLM application – what data it accessed, what queries it ran, what the model returned, and how guardrails performed. Without this, you cannot evaluate, monitor, or improve. And you cannot prove compliance.

For hallucination management specifically, Merck will opt for recommends guardrails at every single transaction in a multi-step agent flow. Not just at the final output. Because one hallucination in step three of a seven-step process corrupts everything downstream.

Siemens: The 80/20 Rule of Industrial AI

Dr. Nikita Golovko’s opening story about the failed defect detection model was not an anecdote about bad luck. It was a deliberate illustration of a systematic problem: most AI failures in industrial settings are AI system architecture failures, not model failures.

His framework for reliable industrial AI covers five layers.

Layer 1: Smart data strategies at the edge

Before you build any model, you need reliable data. Golovko’s approach starts at the edge device, directly at the production line:

  • Pre-processing – Sort out bad data with noise, bad lighting conditions, and corrupted inputs before it ever reaches a model.
  • Quality gateways – Detect data drift, environmental drift, sensor faults, and camera issues in real time.
  • Data contracts – Versioned schemas that can be validated, providing traceability and reproducibility.

The key insight: data quality is not a cloud problem. It is an edge problem. Fix it at the source.

Layer 2: Software engineering best practices for ML Ops

The core principle is straightforward: keep every component modular and independently replaceable. When the model is decoupled from everything around it, you can swap it out, update a data source, or change infrastructure without risking the entire system.

Golovko implements this through what he calls a hexagonal architecture. Adapters and API gateways sit between the model and its environment, each with a clearly defined role:

  • Input adapters for camera feeds
  • Output adapters for PLC commands and data storage
  • A model replacement interface for safe deployments
  • Metrics collection and telemetry interfaces
  • Operator feedback interfaces

The result: no single change triggers a cascade of failures.

Layer 3: Edge-cloud split with a purpose

Inference happens on the shop floor – 50 milliseconds from image capture to PLC (Programmable Logic Controller) command. The operational use case requires such low latency levels. No cloud round-trip can achieve that. But training and retraining happen in the cloud, where compute resources are abundant.

The connection between edge and cloud is secured with signed artifacts, encrypted channels , and routing through firewalls. Security in industrial environments is not an afterthought – it is a constraint that shapes the entire architecture.

Layer 4: The closed feedback loop

Models in isolation drift. Environmental conditions change. Hypotheses that seemed correct during development prove incomplete. Golovko closes the loop through:

  • Drift monitoring integrated into data pre-processing
  • Continuous model performance monitoring against reference results
  • KPI-based triggers for cloud retraining
  • Operator feedback interfaces for relabeling and re-annotation

Without this loop, model performance degrades undetected.

Layer 5: Safe deployment through a smart model lifecycle

New models never go straight to production. Instead:

  1. Shadow deployment – The new model runs in parallel with the existing one, processing the same data but making no decisions. Its performance is monitored.
  2. Canary deployment – The new model starts making decisions, but only for a percentage of data.
  3. Production – After both stages validate performance, the model replaces the existing one.

Every stage includes rollback mechanisms. And every model artifact is stored in a registry with full metadata and linked training datasets.

Golovko’s summary line stays with you: “AI success in industry is 80% architecture and 20% proper modeling. Let us start building for the factory floor, not for the laboratory.”

Google: Four Pillars of Trustworthy AI

Nitesh Singhal from Google approached reliability from a different angle – not industrial automation, but the fundamental challenge of hallucination in GenAI systems.

His framing was direct: “AI can be confident but wrong at the same time. It might say Einstein won an Oscar. These are not bugs. They are byproducts of how models work. And they are real business risks.”

He cited two concrete examples: a health tech company that issued a public apology after AI-generated false medical advice, and a law firm that cited cases in court that did not exist.

The four pillars

Singhal’s framework for trustworthy AI rests on four pillars:

1. Data quality – When training data no longer reflects the environment the model operates in, accuracy erodes, often without any visible signal. Many consumers did experience that first hand when they learned about cut-off dates of Large Language Models. Better data means better AI – and this requires active, ongoing data management.

2. Model selection – Not every model fits every task. Using a GPT-5 scale model for simple classification tasks is wasteful. Using a lightweight model for complex reasoning is dangerous. Balance performance, cost, and explainability.

3. Validation and guardrails – Testing responses against ground truth. Monitoring for anomalous outputs. Cross-checking with trusted sources. Guardrails are an important safety component in AI systems.

4. Human oversight – Humans in the loop for high-stakes outputs. Feedback loops that help models improve. Clear accountability through governance structures.

Practical strategies that work

Singhal moved beyond frameworks into specific techniques:

  • RAG (Retrieval Augmented Generation) – Instead of relying on the model’s training data, retrieve facts from reliable sources before generating. This reduces hallucination by grounding outputs in verified information.
  • Fine-tuning with domain-specific data – Specialize generic foundational models for specific industries. Finance, healthcare, and legal each require models trained on domain-relevant data to produce accurate results.
  • Automated verification at inference time – Fact-checking systems that flag hallucinations, check logic, verify tone, and assess accuracy before outputs reach users.

The proof: a fintech case study

Singhal shared a concrete example. A fintech company built an AI assistant for financial advisors. The pilot version hallucinated so frequently that advisors stopped using it. The project was on the verge of being scrapped.

The turnaround involved four steps:

  1. Building a curated, domain-specific knowledge database
  2. Adding RAG to validate facts against that database
  3. Fine-tuning the model specifically for financial content
  4. Introducing human approval for high-stakes outputs

The results: 95% reduction in factual errors and 63% faster workflows. Financial advisors went from rejecting the tool to calling it their trusted co-pilot.

The Common Thread: Data Quality and Architecture First

Three speakers. Pharma, manufacturing, tech. Enterprise assistant, factory floor AI, financial advisor tool. Completely different use cases.

And yet they converge on the same fundamentals:

  • Data quality is the foundation. Every speaker placed data at the base of their reliability stack. Merck embeds AI into its broader data ecosystem. Siemens starts with smart data strategies at the edge. Google names data quality as pillar number one. Without trusted data, no architecture, no guardrails, and no amount of model sophistication will produce reliable outputs.
  • Product and system design determines success. The model is a component, not the system. How you deploy it, monitor it, secure it, and update it – that is what separates lab results from production value.
  • Observability is non-negotiable. You cannot improve what you cannot trace. Merck traces every LLM transaction. Siemens monitors drift in real time. Google uses automated verification at inference time. Blindly trusting a model in production is a recipe for the kind of failures that make headlines.
  • Humans remain in the loop. But the interaction mode may differ from use case to use case.. Merck has governance standards for every use case. Siemens builds operator feedback directly into the architecture. Google’s Nitesh Singhal mandates human oversight for high-stakes decisions.

What This Means for Your AI Strategy

If your organization is still treating AI as a model selection exercise, these three case studies suggest a recalibration.

The questions that matter are not “Which LLM should we use?” or “How do we fine-tune for our domain?” Those are important, but they account for roughly 20% of the work.

The questions that determine success are:

  • How reliable is the data feeding our AI systems?
  • What happens when a model starts drifting in production?
  • Can we trace every step of an AI decision when something goes wrong?
  • Who is accountable when an AI output is wrong?
  • How do we deploy model updates without risking production stability?

Merck, Siemens, and Google have answered these questions. The frameworks they shared at DATA festival Online are not theoretical – they are running in production, at scale, today.

The gap between AI that impresses in a demo and AI that delivers value in production is not about better models. It is about better engineering. One step after the other. And the first one is about data.

DATA festival Munich is back!
Event | June 16-17, 2026 | Munich, Germany
Join fellow data leaders at the DATA festival to solve strategic challenges with data and AI. This is your community to exchange proven methods and build a strong network with.
Keep an eye out for updates!

Discover more content

Author(s)

Analyst Data & Analytics

Florian is an Analyst for Data & Analytics with a focus on Data Management. His primary interests include topics such as Data Catalogs, Data Intelligence, Data Products, and Data Integration.

He supports companies in selecting suitable software solutions, analyzes market developments, addresses the needs of user organizations, and evaluates innovations from software vendors.

As a co-author of BARC Scores, Research Notes, and Surveys, he regularly shares his insights and expertise. He frequently moderates events on data management topics. He is particularly fascinated by the rapid pace of technological advancement and the central role of data management in enabling the success of forward-looking technologies such as artificial intelligence.

Our newsletter is your source for the latest developments in data, analytics, and AI!