What Syntax for the Semantic Layer?

The next step is high

Feb 28, 2025

For a few years now, we have seen the emergence of a concept called the Semantic Layer. It's not a new fresh idea, but it kinda resonates with the current state of analytics maturity we're reaching.

But while the abstract idea of a semantic layer can be defined pretty easily, it's still hard to navigate around the syntax, the implementation of this layer.

Why do we need a semantic layer in the first place?

We have customer relationship management (CRM), we have project management, we have HR, and we have finance software. The core value of these tools is to play as interfaces. These are the layers where faces get to interact with each other. Inter-faces.

But what's the interface for data?

The obvious answer is the database - especially the data warehouse now. These now have role-based access control (RBAC) and teams (data engineers) managing what's in and out. Hence, it's the main interface to move forward on the data-to-insight motion.

However, this interface is kinda flawed. Firstly because the main syntax to play here hasn't been designed for such, but also because the faces who need interaction with data are plenty, and from very different backgrounds.

Out of these realities, we talk nowadays about the Semantic Layer:

A centralized control panel for company's data – a semantic layer manages data modeling, data access, and sends consistent data and metrics to every BI software, data app, and AI/LLM tool.

This is indeed a layer, an interface, that allows any users to access and interpret data using familiar terms without needing deep technical (data) knowledge and in a consistent way. The best-suited use cases for a semantic layer involve centralized metrics definition, unified data modeling, query optimization with a cache layer, and security and access control governance1.

As I'm writing these lines in 2025, the current state of semantic layers is not as gorgeous: nowadays, a semantic layer is basically a team of data engineers and data analysts that play the ETL game and get ad-hoc analysis done.

Don't get me wrong: like CRM, HR, finance, product, etc. there will always be ad-hoc things and Excel spreadsheets. But like our cousins, we will somehow have the proper software to get our interactions fixed and guidlined. Basically, shifted left.

SQL is not designed for analytics

We have the chance to know our semantics quite well. At least we have proper (sometimes endless) discussions about it. It's how we define ARR, what's a user, what's the user journey, etc. All of these elements are different from company to company, but ultimately we are able to sit down and put the equation on paper.

The next question is how to move from paper to proper sheets?

Here, the database and its main syntax, SQL, is the first thing that comes to mind.

SQL has actually shaped how we think about data. It's all about rectangular pieces. But that brought us limitations.

SQL is not a functional programming language. It's not possible to build up easily queries on top of queries. Put another way: it's not possible to build up our semantics. There's no polymorphism, no functions. It's hard to factor our work.

Semantics often trail behind syntaxes: our world is moving fast and so our ideas and needs. Our syntax is more than 30 years old. It has been designed for OLTP applications, for relational models. But business models aren't relational models.

To bridge this natural gap we have built tools, frameworks, and platforms.

And it works. It's now common to see medallion architectures built on top of dbt and cloud platforms all connected to BI tools and data applications.

This propels a kind of data-to-insight-to-action motion. That makes us, somehow, data-driven.

But in the end, we have yet to upgrade our steering wheel. Managing all the data work through hundreds of lines of SQL, templated with Jinja to fix the lack of functional capabilities, limits us to move to the next step2.

At some point, we will need new syntaxes. Leaders in that space already agree since years:

I have to admit, I didn't think about dbt and Malloy working together... And all of that makes sense in the longer term. I think it'll be really interesting to see how far Jinja gets stretched by what people do with metrics. It feels like entire mini-languages could get built within it, to the point where a lot of what you're describing—an imperative language that compiles to SQL—almost falls out by accident.3

But it's not easy. It's actually "lots and LOTS of work to do".

The LLM Fallacy

Nowadays we are expecting to primarily interact with software in natural language. That's lazy. I mean, good engineers are lazy. But they are lazy in a good way. Because they have interfaces that let them explore the problem and solutions space in a variety of ways. The interface offers nudges. Guidelines.

The bottleneck is in the data-interpretation, the code-comprehension. The bottleneck is always the problem at hand, it's never the boilerplate, it's never the patterns. Our expertise isn't in knowing how to write a case statement in SQL, or anything else you could get an LLM to write - it's in knowing how to deal with the specificities and intricacies of the problem we're working on. The work does happen to be code composition.

We could fall in the trap that natural language is the ultimate interface for the computer. But in the end, the code isn't for the computer—it's for us. It's for humans. If the coding language was meant for the computer, we would all be writing in pure binary instead of these abstracted and symbolic languages.

The code is the interface that we designed to be able to program the computer. It's what we need. It's objective, explicit, unambiguous, (relatively) static, internally consistent, and robust. English has none of these properties—it's subjective, meaning is often implicit, and ambiguous, it's always changing, contradictions appear, and its structure does not hold up to analysis.

And so, the text to SQL motion lacks an essential value: semantic composition. Yes, LLMs are one of the best partners to probe our semantics. It's like discussing with a human: you bring context sentence after sentence, question after question. Good prompting engineering can be a key here to getting good probing questions and moving forward with our meaning-making of the data.

When we create a well-defined and bounded semantic layer, we can provide LLMs with context. While LLMs don't truly understand semantics, they're similar to an intern who would be overwhelmed if given database schemas and asked to perform detailed analysis without proper context. By working within this structured semantic layer, it becomes much easier to combine our internal knowledge, business relationships, and desired insights in a meaningful way. We can probe and build up the semantics in natural language, as we would do naturally in the meetings and at the coffee machine.

As natural language comes with ambiguity and context-dependency making precise composition becomes a challenge.

That's the main reason we write code. We want to write the least possible, but still write it. Because it often brings composition and a set of routines hopefully close to our semantics.

Interface & Composition

I like to think of this whole problem space as a problem of interfaces and composition.

The first interface here is humans - natural language. The second is the database - SQL.

These two interfaces are very very good. While it's hard to build up our semantics with natural language, it's still very useful to understand and probe it. Especially with LLM and good context. There's a world - not so far - where you can talk to the best analyst and expert with your day-to-day language.

Same for SQL: it's probably the best language we had created to interact with the database. However, we probably went too far with it. It was originally designed for OLTP applications 30 years ago. There are no functions in it. We have brought tools around to fix it (templating language like Jinja, frameworks, etc.). But dbt is jQuery, not React. It's ultimately tough to implement the semantics we want with it.

And so comes the semantic layer. An intermediary interface allowing to compose our semantics.

Here the technology space is in bloom.

The leader here is probably Cube. It comes with a declarative DSL allowing to expose a semantic to any common interfaces (dashboards, apps, etc.)

It embarks cache, pre-aggregation, and access control. It also brings some polymorphisms and real composability when well designed, but I agree with Carl that YAML (or any markup language) is probably not the way forward here. At least not only.

Just like programming, data modeling and exploration is a very cognitively demanding process, and cognitively demanding processes benefit tremendously from fast feedback loops. At the Malloy team, we're trying as hard as we can to ensure that data modelers and analysts can achieve "flow state" while doing their work. This means doing hard things, and sweating every tiny detail that causes friction in the experience.

From my experience, I can say that YAML pitfalls don't really apply when it sits in a proper user interface. Also, Cube approach looks similar to Terraform or Kubernetes deployment in the way that it's easy to adapt, to fix and maintain without getting "in the flow". It's much more of an exposition layer.

On the other side we have new languages like Malloy. It changes the way we are looking at data and brings us way closer to the semantics we want to build up. It comes with routines and a defined syntax that just fits naturally. Just look at this.

It gets us in the flow state and allows us to compose our semantics with ease.

For now Malloy doesn't make it easy to expose this semantic. BI tools are waiting for rectangular inputs (tables). For a SQL API call (like in Cube).

However, it does come with a charting library.

Declarative BI is a thing now and there is probably a space where the final interface for insight-to-action motion is not propelled by a pure dashboard built by dragging and dropping around but in natural language bounded to a semantic layer that knows (or where you choose) what's the best way to represent data. Could be a chart, a pivot table, a text, a slide deck, a voice memo or an Excel file.

Recently, I also have been amazed by how dbt and Lightdash integrate well together. Using this duo with a bit of AI to fill dbt documentation: it allows to build a full semantic layer within dbt while taking advantage of pre-built aggregation in Lightdash. It's still SQL interlaced with crappy Jinja. But at least everything is as code and it's a realistic and pragmatic solution - especially now with dbt being a standard. I think it's a good first step to implement a semantic layer (plus in bonus, Lightdash has a metric tree builder - more a toy than anything, but still, we are making progress).

What's the syntax for the semantic layer then?

To me, it's YAML for exposing and maintaining the glue for downstream applications. And it's a new language that allows us to build that semantic upstream. The latter is probably built on top of SQL as it's the best database interface we have.

Depending on the maturity of our business and level of technicality, it could be a bounded chat application, only YAML with Cube, Malloy for cooking analysts, a mix of both, and sometimes it will be another thing. Don't worry we will have a buyer guide.

Like always, the syntax will trail behind our semantics. We should be flexible and be ready to move it to the trash. This is sometimes the best way to move as fast as possible on our ultimate goal: shorten the data-to-insight-to-action motion.

https://cube.dev/blog/universal-semantic-layer-capabilities-integrations-and-enterprise-benefits

SQL is great. We think we don't really need more than that. Like engineers who were dealing with AWS console before using Terraform. That's maybe a wrong call; but try Malloy and you will understand (FWIW; Malloy doesn’t solve all our problem, it simply gives an idea of what things could be).

https://benn.substack.com/p/has-sql-gone-too-far/comment/6072291

From An Engineer Sight

Discussion about this post