VLDB 2025: AI Meets Enterprise Data Management — The Tabular FM Moment

VLDB 2025: AI Meets Enterprise Data Management — The Tabular FM Moment

Snapshot from London (Sep 1–5, 2025).
1,100+ attendees from 40+ countries, 1,600+ submissions, roughly ~20% acceptance. VLDB is a flagship venue for the data management community, right alongside SIGMOD.

I attended to particpate in an invited panel discussion on Neural Relational Data, discussing the need for bespoke foundation models on structured data.

The Big Picture: AI-for-DM and DM-for-AI

Two complementary shifts are gathering pace:

  • AI-for-DM — LLMs and agents do data management (data engineering, schema matching, discovery, evaluation).
  • DM-for-AI — Databases run AI better (new operators, efficient vector support, and data architectures that serve agentic workloads).

The insightful and actionable keynote by Matei Zaharia (CTO and Co-Founder of Databricks) touched on both: databases must evolve for agentic, speculative access patterns [1] and for portable data products that retain semantics across stacks [5].

Spotlight: Foundation Models on Tabular Data (TFMs)

If you want to catch the next wave of GenAI early, the most important trend is the rapid maturation of foundation models for relational/tabular data. VLDB 2025 hosted a panel on “Neural Relational Data: Tabular Foundation Models, LLMs… or both?”, bringing together TFM and LLM camps, with emphasis on multi-table learning and semantics. I had the opportunity to participate as an invited speaker. [9]

Panel Discussion on Neural Relational Data
Panel Discussion on Neural Relational Data

Why this matters

Most business value lives in databases: ERP, CRM, finance, supply chain, telemetry. Traditional ML has long excelled here, but foundation models promise:

  • Generalization across schemas and domains,
  • Few-shot performance on new tasks,
  • Re-usable representations that downstream apps (forecasting, optimization, anomaly detection) can tap into.

What changed in 2025

  • From single tables to relational context. Models are learning across multiple linked tables and capturing business semantics rather than isolated columns.
  • This will broaden to Semantically Linked Tables (SLT): LLMs translate questions to SQL, but predictive & prescriptive tasks need context beyond one or even multiple tables and their schema: declaratic and procedural business knowledge as well as external signals. [8]
  • Knowledge as structure, not just text. Graph-style linking of entities, processes, and external knowledge provides explicit grounding, traceability, and provenance—crucial for regulated industries in an in-content learning setting.
  • Synthetic data (e.g., synthetic time-series corpora [6]) is accelerating training and evaluation without exposing sensitive datasets.
Semantically Linked Tables - what's in it?
Semantically Linked Tables – what’s in it? from: Klein & Hoffart. Foundation Models for Tabular Data within Systemic Contexts Need Grounding. arxiv.org/abs/2505.19825

What this enables

  • Better demand forecasting, risk scoring, next-best-action, fraud/outage detection, and scenario planning—with less feature plumbing and faster time-to-value.
  • Retrieval-augmented analytics over structured data: models “pull” the right tables, rows, and business rules on demand.
  • Explainable decisions via grounded joins and lineage, not opaque text-only chains.

What Enterprises Should Do Now

  1. Lay the semantic foundation. Invest in a linked semantic layer (business objects, relationships, business rules) that bridges operational code, data catalogs, and external knowledge—think Semantically Linked Tables and grounded FM training [8].
  2. Tame schema drift and data integration with AI. Use retriever → LLM reranker pipelines (SLM+LLM) for scalable, explainable schema alignment across apps and extensions [2].
  3. Use synthetic data strategically. Generate task-specific corpora (SQL, time-series, logs) for safe training/eval; keep real-world validation loops [3], [6].
  4. Close the loop on governance and make data accessible for training. Track provenance, time-travel, and branches to enable fine-tuning of foundation models on relational data.

Adjacent Trends and Highlights from VLDB 2025 (In Brief)

  • Make data portable and self-describing. Favor open formats/protocols and “self-decoding” datasets so products move across platforms without losing meaning [4], [5].
  • Prepare for agentic traffic. Expect bursty, speculative probes from AI agents. Add caching, shaping, and accuracy/latency knobs at the data layer, plus interfaces, processing paths, and agent memory stores designed for autonomous workflows [1].
  • Lakehouse → “Lakebase” patterns: OLTP+OLAP unification and app/agent runtime on shared object-store data (with copy-on-write for safe experimentation and agentic development workflows).
  • Data source discovery: Learned indexes that unify indexing and search to help agents find the right tables fast [7].

Bottom Line

The next wave of enterprise AI won’t be just better LLMs and Agents. It will be foundation models on relation data including semantically linked tables trained on agent-aware, portable, semantically rich data systems—so your models learn from the right context and your decisions stand up to scrutiny.

Further Reading

[1] Liu et al. Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First, https://arxiv.org/abs/2509.00997
[2] Liu et al. Magneto: Combining Small and Large Language Models for Schema Matching, https://doi.org/10.14778/3742728.3742757
[3] Schmidt et al. SQLStorm: Taking Database Benchmarking into the LLM Era, https://doi.org/10.14778/3749646.3749683
[4] Gienieczko et al. AnyBlox: A Framework for Self-Decoding Datasets, https://doi.org/10.14778/3749646.3749672
[5] Puttaswamy et al. Delta Sharing: An Open Protocol for Cross-Platform Data Sharing, https://www.vldb.org/pvldb/vol18/p5197-puttaswamy.pdf
[6] Xie et al. ChatTS: Aligning Time Series with LLMs via Synthetic Data, https://arxiv.org/abs/2412.03104
[7] Guo et al. BIRDIE: Natural Language-Driven Table Discovery Using Differentiable Search Index, https://doi.org/10.14778/3734839.3734845
[8] Klein & Hoffart Foundation Models for Tabular Data within Systemic Contexts Need Grounding (FMSLT), https://arxiv.org/abs/2505.19825
[9] Papotti & Binnig (Panel Chairs) Panel: Neural Relational Data—Tabular FMs, LLMs… or both?, https://www.vldb.org/pvldb/vol18/p5513-paolo.pdf