Did my slide deck come across as rubbish? Did I talk too fast? Too technical? Why did no one ask a question?
These were the questions I asked myself after presenting what dbt is in front of all the data teams of the company.
In this blog post, I’ll share the backstory of my failure to convince more than 30 people to move forward and migrate data transformation over dbt.
In light of my recent experiences and new views, I’ll review what I could do better and how a mix of product and sales education is a keystone for any engineer willing to improve himself and the company he’s working for.
Five Data Teams, Apache Spark, and a Cloud Migration Come Into a Bar…
I started to think about using dbt 4 months after entering the company.
I was working as a data engineer in a team of more than 20 analysts and a few data scientists — with the mission to provide analytics to every company domain.
I used dbt for over a year in my previous job and I felt it could be a great fit for addressing the challenges we were facing:
Migration to the cloud. The company was on the verge of finishing the migration from a full on-premise Hadoop stack to the Google Cloud Platform.
5 data teams. While the team I was in was the biggest one in terms of people, five other teams were doing stuff like system recommendation, royalties computations, data infrastructure, CRM, etc. All of them were reporting to different C-level executives and with different levels of maturity. Navigating the organization was sometimes tough.
From Spark to SQL. Originally, Spark handled most of the heavy lifting for data transformations. However, migrating to Google Cloud Platform (GCP) and the current job market trends made using SQL queries a more attractive option. Especially for data analysts.
My role in this system was two-fold:
Support the practices of data analysts by bringing software best practices, scalability, and making data modelization techniques progress.
Participate in the machine learning modelization and help the data scientist team to deploy models behind customer-facing APIs on Kubernetes.
For those who like job names: it was a kind of mix between the work of an analytics engineer and a machine learning engineer.
In this broader context, it was clear to me that a framework such as dbt would help a lot. Especially for the data analysts who didn’t get used to the Spark syntax.
Error 1: Thinking Everyone Was OK with SQL
In all of my previous experiences, SQL was the language to go with for anything related to analytics.
I used to work a lot on BigQuery or AWS Athena. I was biased toward the declarative syntax and transparent computing provided by these platforms.
On the opposite, many of the teams were using Spark to do heavy computations, train machine learning models, etc.
With Hadoop (Cloudera) as the main infrastructure, they had no other choice (data volume was driven by B2C business — so easily more than 1 To of data processing).
Most of the data engineers and data scientists were used to Spark, but quite surprisingly to me: not to SQL.
I thought that anyone knowing about Spark would be used to SQL. It’s not the case.
This was the first prior I didn’t take into account during my presentation: explicitly showing that SQL was taking the lead for more and more queries in the company — driven by the GCP migration and people using more and more BigQuery.
Error 2: Think Another Migration Was Acceptable
Related to my first error, I used to think that migrating from Spark to SQL would make sense, and wouldn’t be too much of a pain.
One thing I overlooked was the fact that Spark was doing a great job for my situation. Using SQL to loop over nested fields is usually not a good thing to do. Yes modeling data properly in a medallion-like architecture helps, but it also asks a bit more governance and maturity than what was in place back then.
Keep reading with a 7-day free trial
Subscribe to From An Engineer Sight to keep reading this post and get 7 days of free access to the full post archives.