VLDB 2025: AI Meets Enterprise Data Management — The Tabular FM Moment
Snapshot from London (Sep 1–5, 2025).
1,100+ attendees from 40+ countries, 1,600+ submissions, roughly ~20% acceptance. VLDB is a flagship venue for the data management community, right alongside SIGMOD.
I attended to particpate in an invited panel discussion on Neural Relational Data, discussing the need for bespoke foundation models on structured data.
The Big Picture: AI-for-DM and DM-for-AI
Two complementary shifts are gathering pace:
- AI-for-DM — LLMs and agents do data management (data engineering, schema matching, discovery, evaluation).
- DM-for-AI — Databases run AI better (new operators, efficient vector support, and data architectures that serve agentic workloads).
The insightful and actionable keynote by Matei Zaharia (CTO and Co-Founder of Databricks) touched on both: databases must evolve for agentic, speculative access patterns [1] and for portable data products that retain semantics across stacks [5].
Spotlight: Foundation Models on Tabular Data (TFMs)
If you want to catch the next wave of GenAI early, the most important trend is the rapid maturation of foundation models for relational/tabular data. VLDB 2025 hosted a panel on “Neural Relational Data: Tabular Foundation Models, LLMs… or both?”, bringing together TFM and LLM camps, with emphasis on multi-table learning and semantics. I had the opportunity to participate as an invited speaker. [9]

Why this matters
Most business value lives in databases: ERP, CRM, finance, supply chain, telemetry. Traditional ML has long excelled here, but foundation models promise:
- Generalization across schemas and domains,
- Few-shot performance on new tasks,
- Re-usable representations that downstream apps (forecasting, optimization, anomaly detection) can tap into.
What changed in 2025
- From single tables to relational context. Models are learning across multiple linked tables and capturing business semantics rather than isolated columns.
- This will broaden to Semantically Linked Tables (SLT): LLMs translate questions to SQL, but predictive & prescriptive tasks need context beyond one or even multiple tables and their schema: declaratic and procedural business knowledge as well as external signals. [8]
- Knowledge as structure, not just text. Graph-style linking of entities, processes, and external knowledge provides explicit grounding, traceability, and provenance—crucial for regulated industries in an in-content learning setting.
- Synthetic data (e.g., synthetic time-series corpora [6]) is accelerating training and evaluation without exposing sensitive datasets.
What this enables
- Better demand forecasting, risk scoring, next-best-action, fraud/outage detection, and scenario planning—with less feature plumbing and faster time-to-value.
- Retrieval-augmented analytics over structured data: models “pull” the right tables, rows, and business rules on demand.
- Explainable decisions via grounded joins and lineage, not opaque text-only chains.
What Enterprises Should Do Now
- Lay the semantic foundation. Invest in a linked semantic layer (business objects, relationships, business rules) that bridges operational code, data catalogs, and external knowledge—think Semantically Linked Tables and grounded FM training [8].
- Tame schema drift and data integration with AI. Use retriever → LLM reranker pipelines (SLM+LLM) for scalable, explainable schema alignment across apps and extensions [2].
- Use synthetic data strategically. Generate task-specific corpora (SQL, time-series, logs) for safe training/eval; keep real-world validation loops [3], [6].
- Close the loop on governance and make data accessible for training. Track provenance, time-travel, and branches to enable fine-tuning of foundation models on relational data.
Adjacent Trends and Highlights from VLDB 2025 (In Brief)
- Make data portable and self-describing. Favor open formats/protocols and “self-decoding” datasets so products move across platforms without losing meaning [4], [5].
- Prepare for agentic traffic. Expect bursty, speculative probes from AI agents. Add caching, shaping, and accuracy/latency knobs at the data layer, plus interfaces, processing paths, and agent memory stores designed for autonomous workflows [1].
- Lakehouse → “Lakebase” patterns: OLTP+OLAP unification and app/agent runtime on shared object-store data (with copy-on-write for safe experimentation and agentic development workflows).
- Data source discovery: Learned indexes that unify indexing and search to help agents find the right tables fast [7].
Bottom Line
The next wave of enterprise AI won’t be just better LLMs and Agents. It will be foundation models on relation data including semantically linked tables trained on agent-aware, portable, semantically rich data systems—so your models learn from the right context and your decisions stand up to scrutiny.
Further Reading
[1] Liu et al. Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First, https://arxiv.org/abs/2509.00997
[2] Liu et al. Magneto: Combining Small and Large Language Models for Schema Matching, https://doi.org/10.14778/3742728.3742757
[3] Schmidt et al. SQLStorm: Taking Database Benchmarking into the LLM Era, https://doi.org/10.14778/3749646.3749683
[4] Gienieczko et al. AnyBlox: A Framework for Self-Decoding Datasets, https://doi.org/10.14778/3749646.3749672
[5] Puttaswamy et al. Delta Sharing: An Open Protocol for Cross-Platform Data Sharing, https://www.vldb.org/pvldb/vol18/p5197-puttaswamy.pdf
[6] Xie et al. ChatTS: Aligning Time Series with LLMs via Synthetic Data, https://arxiv.org/abs/2412.03104
[7] Guo et al. BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index, https://doi.org/10.14778/3734839.3734845
[8] Klein & Hoffart Foundation Models for Tabular Data within Systemic Contexts Need Grounding (FMSLT), https://arxiv.org/abs/2505.19825
[9] Papotti & Binnig (Panel Chairs) Panel: Neural Relational Data—Tabular FMs, LLMs… or both?, https://www.vldb.org/pvldb/vol18/p5513-paolo.pdf