Case Study

Automating Data Steward Workflows: Natural Language to Semantic Types

How a leading data quality platform integrated AI-powered automation to transform months of manual steward work into instant semantic type generation

Executive Summary

The Challenge

$4M+ in labor costs per customer

Generating 7,000 custom semantic types required 28,000 hours of manual work at $150/hour. Data stewards spent months writing regex patterns, tuning confidence scores, and documenting edge cases.

The Solution

AI-powered automation in 30 seconds

Production-ready semantic types generated from plain English instantly. Expert-level accuracy across telco, banking, and insurance domains.

Business Impact

$10M+ annual savings
4 years → minutes

Saves $4M+ per customer by eliminating 28,000 hours of manual work. Across the customer base, this delivers $10M+ in annual savings.

How Data Profiling Works

Imagine you have dozens of disconnected data sources, each with thousands of columns. Some columns contain names, others have credit card numbers, and some hold dates. The company's profiling engine scans across all these sources and automatically labels columns with semantic types, tags like 'First Name,' 'Email Address,' or 'SSN.' These tags aren't just labels; they power data quality checks ('Is this email formatted correctly?'), privacy rules ('Mask all SSN fields'), and compliance policies ('Encrypt credit card data').

Profiling Engine Scanning...

customer_name

John Smith

Jane Doe

Bob Wilson

email_addr

john@ex.com

jane@ex.com

bob@ex.com

phone

555-0123

555-0456

555-0789

ssn

***-**-1234

***-**-5678

***-**-9012

Semantic types automatically detected and applied to each column

The Problem: Creating Custom Semantic Types

The platform ships with base semantic types that work everywhere: 'Email Address,' 'Phone Number,' 'ZIP Code.' But enterprise customers need company-specific types: 'Customer ID,' 'Policy Number,' 'MAC Address,' 'Order Date.' Every company's data is different. To create a new semantic type, data stewards had to manually craft every component:

Content Regex: Pattern matching the data itself (e.g., for 'MAC Address': match patterns like 'A1:B2:C3:D4:E5:F6')
Header Regex Variants: Multiple patterns matching column names (e.g., 'mac_addr,' 'device_mac,' 'network_id'), each with manually assigned confidence scores
Metadata: Human-readable names, detailed descriptions, usage examples
Validation Rules: Edge cases, format constraints, and error handling logic

Each semantic type required hours of careful work: writing regex patterns, tuning confidence scores, documenting edge cases. For a large enterprise needing 7,000+ custom types across telco, banking, and insurance domains, this process consumed months of data steward time. At roughly $150/hour for a skilled data steward, this represented over $4 million in labor costs for a single customer, creating a massive bottleneck in data onboarding.

semantic_type.json

{

name: "MAC Address",

contentPattern: "/^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$/",

headerPatterns: [

{ pattern: "/^mac_?addr(ess)?$/i", confidence: 0.98 },

{ pattern: "/^hw_?address$/i", confidence: 0.89 }

priority: high,

}

Time Elapsed

hours

Per semantic type

× 7,000+ types needed

Manual creation of semantic types: hours of tedious JSON configuration

The Platform: AI-Powered Semantic Type Generation

The AI platform generates semantic types instantly from plain English. Data stewards simply type 'MAC Address', and the platform automatically produces:

Content regex patterns that match MAC address formats
Multiple header regex variants ('mac_addr,' 'device_mac,' 'hw_address') with confidence scores
Ready-to-deploy semantic type definition in seconds

Input

MAC Address

Natural language input

AI Engine

Output

Waiting...

Natural language instantly converted to complete semantic type definition

The business impact was immediate. What previously required a senior data steward 4 hours per semantic type (research, regex writing, testing, documentation) now completed in under 30 seconds. For a typical enterprise deployment needing 7,000 custom types, this eliminated 28,000 labor hours (roughly 4 years of full-time work) and saved over $10 million in steward labor costs.

The Hard Problem: Disambiguation

Here's where it gets interesting. Imagine two date columns in a dataset:

Column A: '01/15/2024', '03/22/2024', '12/01/2023'
Column B: '01/18/2024', '03/25/2024', '12/05/2023'

Both columns have identical content format (MM/DD/YYYY). Traditional regex can't tell them apart. But one is 'Order Date' and the other is 'Ship Date'; they have different business meanings. How do we distinguish them? The answer: header analysis.

The platform generates header-specific patterns with confidence scores:

Column A header 'order_dt' → 95% confidence for 'Order Date' semantic type
Column B header 'ship_date' → 92% confidence for 'Ship Date' semantic type

When content patterns tie, header confidence becomes the intelligent tie-breaker, ensuring the right semantic type gets assigned even in ambiguous cases.

Identical Content Format (MM/DD/YYYY)

txn_date

01/15/2024

03/22/2024

12/01/2023

settle_dt

01/18/2024

03/25/2024

12/05/2023

Header analysis resolves ambiguity → Correct types assigned

Validation: Real-World Industry Data

The platform was rigorously validated against real enterprise data from telco, banking, and insurance customers. Data governance experts tested the platform across complex, domain-specific datasets, achieving expert-level accuracy on semantic types that would have taken stewards months to create manually. This accuracy was critical. Every percentage point of accuracy translated to hours saved in manual review and correction.

Holy crap, this is going to be game changing for us
— Head of Customer Success, Leading Data Quality Platform