Operationalizing Gen AI: From Pilot to Production

Many financial institutions have run at least one Gen AI pilot (e.g. a chatbot, a document summarizer, or an internal assistant). The next step is turning that into a repeatable way to ship and run Gen AI applications with governance, evaluation, and clear ownership. This post outlines a practical path without overclaiming.

Start with one use case and one success metric

Pilots often focus on "can we build it?" Production demands "can we run it safely and measure it?" Pick one workflow (e.g. policy Q&A, underwriting summary, or analyst research aid) and define: (1) who uses it, (2) what "good" looks like (accuracy, latency, user satisfaction), and (3) what must be logged or reviewed for compliance. That shapes your architecture (RAG, agents, fine-tuning) and your guardrails.

Guardrails and safety

Input: Validate and sanitize user input; block or redact PII if the model should not see it. Mosaic AI Gateway provides PII detection (credit cards, SSNs, emails, phone numbers) out of the box.
Output: Use structured outputs (e.g. OpenAI's strict JSON Schema or Claude's tool use) so responses fit your downstream systems and you can validate them. AI Gateway's AI guardrails add safety filtering to block harmful content.
Tool use: If the model calls tools (APIs, DBs), enforce permissions and rate limits in your code; never let the model decide who can do what. Log every call and result for audit. AI Gateway handles payload logging to Delta tables via Unity Catalog.

The Databricks production stack

For production Gen AI in financial services, the Databricks platform provides an integrated stack:

Lakeflow Connect ingests data from source systems via managed connectors
Spark Declarative Pipelines handle batch and streaming ETL transformations
Lakeflow Jobs orchestrate the end-to-end workflow (DAGs with conditional logic, triggers, monitoring)
Delta Lake stores all data with ACID transactions
Unity Catalog governs everything: data, models, functions, pipelines, vector search, serving endpoints
AI Functions apply AI directly on data in SQL or PySpark
AI Gateway + Model Serving provide access to all models with governance
MLflow evaluates agent and model accuracy with custom domain-specific metrics

Evaluation and iteration

Production Gen AI needs ongoing evaluation: (1) correctness (e.g. against golden sets or human review), (2) safety and policy (no leaking internal data, no harmful content), and (3) latency and cost. Start with a small labeled set and periodic human review; add automated checks (e.g. format validation, PII checks) as you scale. MLflow Agent Evaluation provides structured evaluation for agentic workflows.

Team and skills

Going from pilot to production usually requires: (1) product or business ownership of the use case and metrics, (2) engineering for APIs, pipelines, and guardrails, and (3) domain expertise (risk, compliance, operations) so guardrails and review flows match the business. You do not need a huge team; you need clear roles and at least one person who can bridge model behavior and production systems.

When to use embedded experts or partners

If you lack in-house capacity for LLM integration, MCP, or agentic design, embedded experts (contract or staff augmentation) can own the build while your team owns requirements and rollout. The goal is to establish patterns (e.g. "how we do RAG," "how we do agentic tool use on Databricks") so that the next use case is faster and more consistent. That is how Gen AI becomes operational rather than a one-off experiment.

Operationalizing Gen AI: From Pilot to Production

Start with one use case and one success metric

Guardrails and safety

Input: Validate and sanitize user input; block or redact PII if the model should not see it. Mosaic AI Gateway provides PII detection (credit cards, SSNs, emails, phone numbers) out of the box.

Output: Use structured outputs (e.g. OpenAI's strict JSON Schema or Claude's tool use) so responses fit your downstream systems and you can validate them. AI Gateway's AI guardrails add safety filtering to block harmful content.

Tool use: If the model calls tools (APIs, DBs), enforce permissions and rate limits in your code; never let the model decide who can do what. Log every call and result for audit. AI Gateway handles payload logging to Delta tables via Unity Catalog.

The Databricks production stack

For production Gen AI in financial services, the Databricks platform provides an integrated stack:

Lakeflow Connect ingests data from source systems via managed connectors

Spark Declarative Pipelines handle batch and streaming ETL transformations

Lakeflow Jobs orchestrate the end-to-end workflow (DAGs with conditional logic, triggers, monitoring)

Delta Lake stores all data with ACID transactions

Unity Catalog governs everything: data, models, functions, pipelines, vector search, serving endpoints

AI Functions apply AI directly on data in SQL or PySpark

AI Gateway + Model Serving provide access to all models with governance

MLflow evaluates agent and model accuracy with custom domain-specific metrics

Evaluation and iteration

Team and skills

When to use embedded experts or partners

Operationalizing Gen AI: From Pilot to Production in Financial Services

Operationalizing Gen AI: From Pilot to Production

Start with one use case and one success metric

Guardrails and safety

The Databricks production stack

Evaluation and iteration

Team and skills

When to use embedded experts or partners

Operationalizing Gen AI: From Pilot to Production in Financial Services

Operationalizing Gen AI: From Pilot to Production

Start with one use case and one success metric

Guardrails and safety

The Databricks production stack

Evaluation and iteration

Team and skills

When to use embedded experts or partners