UniversalAGI raises $10M pre-seed led by Elad Gil
Case Study

Automating Data Steward Workflows: Natural Language to Semantic Types

How a leading data quality platform integrated AI-powered automation to transform months of manual steward work into instant semantic type generation

Executive Summary

The Challenge
$4M+ in labor costs per customer

Generating 7,000 custom semantic types required 28,000 hours of manual work at $150/hour. Data stewards spent months writing regex patterns, tuning confidence scores, and documenting edge cases.

The Solution
AI-powered automation in 30 seconds

Production-ready semantic types generated from plain English instantly. Expert-level accuracy across telco, banking, and insurance domains.

Business Impact
$10M+ annual savings
4 years → minutes

Saves $4M+ per customer by eliminating 28,000 hours of manual work. Across the customer base, this delivers $10M+ in annual savings.

How Data Profiling Works

Imagine you have dozens of disconnected data sources, each with thousands of columns. Some columns contain names, others have credit card numbers, and some hold dates. The company's profiling engine scans across all these sources and automatically labels columns with semantic types, tags like 'First Name,' 'Email Address,' or 'SSN.' These tags aren't just labels; they power data quality checks ('Is this email formatted correctly?'), privacy rules ('Mask all SSN fields'), and compliance policies ('Encrypt credit card data').

Profiling Engine Scanning...
customer_name
John Smith
Jane Doe
Bob Wilson
email_addr
john@ex.com
jane@ex.com
bob@ex.com
phone
555-0123
555-0456
555-0789
ssn
***-**-1234
***-**-5678
***-**-9012
Semantic types automatically detected and applied to each column

The Problem: Creating Custom Semantic Types

The platform ships with base semantic types that work everywhere: 'Email Address,' 'Phone Number,' 'ZIP Code.' But enterprise customers need company-specific types: 'Customer ID,' 'Policy Number,' 'MAC Address,' 'Order Date.' Every company's data is different. To create a new semantic type, data stewards had to manually craft every component:

  • Content Regex: Pattern matching the data itself (e.g., for 'MAC Address': match patterns like 'A1:B2:C3:D4:E5:F6')
  • Header Regex Variants: Multiple patterns matching column names (e.g., 'mac_addr,' 'device_mac,' 'network_id'), each with manually assigned confidence scores
  • Metadata: Human-readable names, detailed descriptions, usage examples
  • Validation Rules: Edge cases, format constraints, and error handling logic

Each semantic type required hours of careful work: writing regex patterns, tuning confidence scores, documenting edge cases. For a large enterprise needing 7,000+ custom types across telco, banking, and insurance domains, this process consumed months of data steward time. At roughly $150/hour for a skilled data steward, this represented over $4 million in labor costs for a single customer, creating a massive bottleneck in data onboarding.

semantic_type.json
{
name: "MAC Address",
contentPattern: "/^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$/",
headerPatterns: [
{ pattern: "/^mac_?addr(ess)?$/i", confidence: 0.98 },
{ pattern: "/^hw_?address$/i", confidence: 0.89 }
],
priority: high,
}
Time Elapsed
0
hours
Per semantic type
× 7,000+ types needed
Manual creation of semantic types: hours of tedious JSON configuration

The Platform: AI-Powered Semantic Type Generation

The AI platform generates semantic types instantly from plain English. Data stewards simply type 'MAC Address', and the platform automatically produces:

  • Content regex patterns that match MAC address formats
  • Multiple header regex variants ('mac_addr,' 'device_mac,' 'hw_address') with confidence scores
  • Ready-to-deploy semantic type definition in seconds
Input
MAC Address
Natural language input
AI Engine
Output
Waiting...
Natural language instantly converted to complete semantic type definition

The business impact was immediate. What previously required a senior data steward 4 hours per semantic type (research, regex writing, testing, documentation) now completed in under 30 seconds. For a typical enterprise deployment needing 7,000 custom types, this eliminated 28,000 labor hours (roughly 4 years of full-time work) and saved over $10 million in steward labor costs.

The Hard Problem: Disambiguation

Here's where it gets interesting. Imagine two date columns in a dataset:

  • Column A: '01/15/2024', '03/22/2024', '12/01/2023'
  • Column B: '01/18/2024', '03/25/2024', '12/05/2023'

Both columns have identical content format (MM/DD/YYYY). Traditional regex can't tell them apart. But one is 'Order Date' and the other is 'Ship Date'; they have different business meanings. How do we distinguish them? The answer: header analysis.

The platform generates header-specific patterns with confidence scores:

  • Column A header 'order_dt' → 95% confidence for 'Order Date' semantic type
  • Column B header 'ship_date' → 92% confidence for 'Ship Date' semantic type

When content patterns tie, header confidence becomes the intelligent tie-breaker, ensuring the right semantic type gets assigned even in ambiguous cases.

Identical Content Format (MM/DD/YYYY)
txn_date
01/15/2024
03/22/2024
12/01/2023
settle_dt
01/18/2024
03/25/2024
12/05/2023
Header analysis resolves ambiguity → Correct types assigned

Validation: Real-World Industry Data

The platform was rigorously validated against real enterprise data from telco, banking, and insurance customers. Data governance experts tested the platform across complex, domain-specific datasets, achieving expert-level accuracy on semantic types that would have taken stewards months to create manually. This accuracy was critical. Every percentage point of accuracy translated to hours saved in manual review and correction.

Holy crap, this is going to be game changing for us

Head of Customer Success, Leading Data Quality Platform
UniversalAGI - AI-Powered Data Intelligence Platform