Automating Data Steward Workflows: Natural Language to Semantic Types
How a leading data quality platform integrated AI-powered automation to transform months of manual steward work into instant semantic type generation
Executive Summary
Generating 7,000 custom semantic types required 28,000 hours of manual work at $150/hour. Data stewards spent months writing regex patterns, tuning confidence scores, and documenting edge cases.
Production-ready semantic types generated from plain English instantly. Expert-level accuracy across telco, banking, and insurance domains.
4 years → minutes
Saves $4M+ per customer by eliminating 28,000 hours of manual work. Across the customer base, this delivers $10M+ in annual savings.
How Data Profiling Works
Imagine you have dozens of disconnected data sources, each with thousands of columns. Some columns contain names, others have credit card numbers, and some hold dates. The company's profiling engine scans across all these sources and automatically labels columns with semantic types, tags like 'First Name,' 'Email Address,' or 'SSN.' These tags aren't just labels; they power data quality checks ('Is this email formatted correctly?'), privacy rules ('Mask all SSN fields'), and compliance policies ('Encrypt credit card data').
The Problem: Creating Custom Semantic Types
The platform ships with base semantic types that work everywhere: 'Email Address,' 'Phone Number,' 'ZIP Code.' But enterprise customers need company-specific types: 'Customer ID,' 'Policy Number,' 'MAC Address,' 'Order Date.' Every company's data is different. To create a new semantic type, data stewards had to manually craft every component:
- Content Regex: Pattern matching the data itself (e.g., for 'MAC Address': match patterns like 'A1:B2:C3:D4:E5:F6')
- Header Regex Variants: Multiple patterns matching column names (e.g., 'mac_addr,' 'device_mac,' 'network_id'), each with manually assigned confidence scores
- Metadata: Human-readable names, detailed descriptions, usage examples
- Validation Rules: Edge cases, format constraints, and error handling logic
Each semantic type required hours of careful work: writing regex patterns, tuning confidence scores, documenting edge cases. For a large enterprise needing 7,000+ custom types across telco, banking, and insurance domains, this process consumed months of data steward time. At roughly $150/hour for a skilled data steward, this represented over $4 million in labor costs for a single customer, creating a massive bottleneck in data onboarding.
The Platform: AI-Powered Semantic Type Generation
The AI platform generates semantic types instantly from plain English. Data stewards simply type 'MAC Address', and the platform automatically produces:
- Content regex patterns that match MAC address formats
- Multiple header regex variants ('mac_addr,' 'device_mac,' 'hw_address') with confidence scores
- Ready-to-deploy semantic type definition in seconds
The business impact was immediate. What previously required a senior data steward 4 hours per semantic type (research, regex writing, testing, documentation) now completed in under 30 seconds. For a typical enterprise deployment needing 7,000 custom types, this eliminated 28,000 labor hours (roughly 4 years of full-time work) and saved over $10 million in steward labor costs.
The Hard Problem: Disambiguation
Here's where it gets interesting. Imagine two date columns in a dataset:
- Column A: '01/15/2024', '03/22/2024', '12/01/2023'
- Column B: '01/18/2024', '03/25/2024', '12/05/2023'
Both columns have identical content format (MM/DD/YYYY). Traditional regex can't tell them apart. But one is 'Order Date' and the other is 'Ship Date'; they have different business meanings. How do we distinguish them? The answer: header analysis.
The platform generates header-specific patterns with confidence scores:
- Column A header 'order_dt' → 95% confidence for 'Order Date' semantic type
- Column B header 'ship_date' → 92% confidence for 'Ship Date' semantic type
When content patterns tie, header confidence becomes the intelligent tie-breaker, ensuring the right semantic type gets assigned even in ambiguous cases.
Validation: Real-World Industry Data
The platform was rigorously validated against real enterprise data from telco, banking, and insurance customers. Data governance experts tested the platform across complex, domain-specific datasets, achieving expert-level accuracy on semantic types that would have taken stewards months to create manually. This accuracy was critical. Every percentage point of accuracy translated to hours saved in manual review and correction.
Holy crap, this is going to be game changing for us
— Head of Customer Success, Leading Data Quality Platform