Why isn't a free Reforge for data analysis?
Why isn't a free Reforge for data stack?
If there is any, please point them out to me.
Yes, each company has a different set of data and contexts that makes one analysis hard to reproduce in another set.
Building a cohort analysis, a bar chart with time on the x-axis and a metric on the axis, or a pie chart that nobody will look at1: all these always have a common ground.
The core issue on this is two-fold: we build our charts without reusability in mind. And we build them with tools that make it hard to reproduce them.
"BI as Code" isn't a new trend out of thin air. It's again the realization that code is a necessary need to build up on our work. To do reusable work.
Yes, it's most of the time harder than drag and drop in the first place. But recent advances in tools, languages, and support, make our stall spaces starting to move forward again.
I'm craving for the common ground and "first go to" analysis every company is doing. Can we automate it? Make a template out of it?
If the inputs are an activity schema model or a classic dim & fact architecture, could I write a Malloy code, a Cube specification, or a Rill declarative dashboard that will fit many cases?
Where is this library of analysis reusability and the community that goes with it?
I'm exploring such a possibility and launching an MVP. Feel free to contact me if this means anything to you!
📡 Expected Contents
Lessons Learned Implementing Metric Trees
I recently had a nice call with a Head of Data: he struggled with shadow data practices and tons of data to expose.
It's hard to prioritize the next dashboards, the next north star metric, the next SQL queries.
This nice interview traces back to the implementation of the metric tree in a startup. As for the activity schema, it's great to see these implementations of new data frameworks taking life in real companies.
I can't wait to read similar feedback2.
The Icerberg Scam?
I recently wanted to deep dive into S3 tables and how to use DuckDB as the query engine here.
I quickly stopped as per the screenshot below and this great post it's only possible to access these S3 tables with Apache Spark as an open-source query engine.
Yes, it's not related to Apache Iceberg per se. Still, from discussions with some pairs in the field, it sounds like Databricks or Snowflake are the only "serious" (not too painful) way to go. But then I feel the promises of "open" table format and cost reduction aren't satisfied.
Again, it's a very new field, but my main point here is why we care so much right now?
I mean, we have great data warehouses. It's not that expensive. Especially knowing we don't have big data and the promise is to create value out of the data. Not reducing costs as the main focus.
To me, it sounds like a very engineering subject. Trendy because the macro-economic trend is about cost reduction and efficiency. But having our engineers focus on implementing a lakehouse while our data models are still crappy and hard to maintain sounds like a suboptimal path to me. Ahead of phase.
I might be wrong and too opinionated on this. What's your view?
Sounds like our busyness isn't that much focus on the business recently.
dbt, sdf, sqlmesh, quary
We're seeing the SQL for analytics world converging. dbt aquired sdf. sqlmesh acquired quary.
I recently built a full data model using only BigQuery views and corresponding Terraform declarations. Don't get me wrong, I started by using SQLMesh. But in the end, these tools are only necessary when you have a strong team of analysts and you want to enforce some best practices. In my case, I was alone, and without tons of data sources to gather.
It's ok to schedule queries for partition creation, to create views of views, to template queries in plain Python. To simply make things done, without creating a monster of dependancies.
These new frameworks are new and are mostly glue allowing to template SQL because we lack a proper lingua for the stuff we want to do.
Something I learned in the past 5 years: don't use a framework because it's trendy. Because it's what everyone is doing.
Use it when you know your incentives. When you explicitly need a frame to scale your work. That's the goal of a framework.
About Super-Intelligence
I still don't know how much I should deep into the AI subject3. I sometimes think about all these engineers, all these prompts, all these new models, etc. As we are probably approaching the end of the "early adopter" curve, it might be the time isn't?
If like me you're looking for a great resource for getting the big picture, I can't recommend more Leopold Aschenbrenner's blog. He is an ex-Open AI and has written a huge series of posts regarding AI and the superintelligence singularity.
Take a big cup of coffee before pulling these threads.
📰 The Blog Post
Last month I wrote about a more realistic call regarding my "SQL is not designed for analytics" views. I still stand by this opinion, but to move forward we need small steps. We need to be pragmatic. That's what I'm exploring in this blog, with BigFunctions as one solution in this complex equation.
Beyond SQL as a Pure Database Syntax
Reminder: code is technical debt. The optimal spot for engineers is to write the least code possible but still write code.
🎨 Beyond The Bracket
It's amazing how quickly Midjourney blog header images went from a "cool way to stand out" to a "sign of low-effort content".
Long-time readers know how much I care about aesthetics, and especially about photography. And my profound connection to identity.
I craft most of this newsletter and blog post manually, without AI intervention. I remain an optimist and technology enthusiast, so I primarily use AI as an editorial assistant. It performs admirably, allowing me to preserve my voice while benefiting from precise proofreading and grammatical checks.
On the same spot, my most compelling reads emerge from authentic writers. My favorites are those with unmistakable, distinctive styles. Pieces that create an immediate sense of intimacy and comfort. As if we're long-time friends sharing an unspoken understanding.
In an era where AI usage will become as commonplace as using Google Docs, I might take advantage of this new paradigm. Yet, ultimately, the core of my writing—the essence of my voice—resides in my daily experiences, in spontaneous conversations, in those crumpled notebook pages, and scattered iPhone notes. Not in the sterile output of a generic AI prompt interface.
At work, I've been doing similar things. These past weeks, I focused on building a data model. I used AI a bit: to clean up my database schema and check my metrics. The real work was connecting ideas, testing assumptions, and designing solutions. The final result matches a specific case that would be hard to create through simple prompting.
I feel like I got a life upgrade recently. Though work. Heart loaded. It feels like my meaning-making delta isn't in sync with my current version of self.
Yet, I know that everything will ultimately clock together. As usual, in the months to come.
The writing I'm putting out right now is probably the last bit of a me.
And so, the reading I'm discovering right now is probably the first bit of a new me.
See you in March ☀️
because pie charts are usually not a good idea. Please stop using them.
did a bunch of data modeling recently. Will probably come back with my experience feedback in a dedicated post. Please add a comment if you had any experience with data modeling implementation recently!
also exploring Model Context Protocol recently.
We've been using DuckDB in production for a while now, and tested several approaches.
For now, the most (cost) efficient way we have found was to stream parquet files from a NFS based host.
The object storage model is way too expensive as you're paying on a per query basis. NFS file server makes it much easier, especially on the streaming part.