The Data Team Equation: Balancing Staffing with Provisioning
When to pay for a tool or not? Staffing or provisioning?
We are taking the best practices of software. But data and classic software are only cousins, not twins.
Data is less mature. It’s made of less people. It’s much more complex as data is literally everywhere. We’re only at the liminal state here.
Still, we don’t invest that much to move forward. I mean, we invest a lot in the team and projects but not in the tooling.
While we invest in cloud providers to support our need for computing and storage, we rely a lot on open-source projects for all our operations.
Why? How much should we invest real money in tooling? What we should look at?
In the following sections, I will discuss some potential answers and provide my perspective, drawing from my experience as a data engineer in B2C organizations and my time behind the scenes at a company that develops open-source software.
The Case of Accountants
Accountancy is probably the most nominal job in any company. Accountants have to release sharp figures, exact to the comma.
While some still use Microsoft Excel to do the math, it’s still the “zero” level.
When they get mature they need to use professional software. And these cost a lot of dollars for one big reason: they deal with critical data and procurement operations.
If the software doesn’t work on March 30th, it’s the panic to close Q1. If data is inaccurate, the market could show a significant decline.
Paying for top-notch tools here isn’t a debate. It’s what makes the work great and professional.
Without it, accountancy will be hard, drawn to error, and hard to govern.
Doesn’t that remind you of something?
The Open-Source Fallacy
Data engineers — and the data industry generally — rely a lot on open-source software.
Actually, the first layer of any data architecture relies on property infrastructure, the so-called “cloud provider”. Think Google Cloud Platform, Amazon Web Service, or Microsoft Azure.
But the second layer — the one that operates on top of these cloud services — is mostly built with open-source tools.
The most used databases (Postgres), the most used interfaces (SQL and Python), and the most used orchestrators (Apache Airflow): all of the core operators to make data flows are open-source software.
Open-source software sounds awesome, right?
Great communities. Great product.
Easy to tinker with, and open to play around.
Plus, it’s free.
Who wants to pay a fortune when the project’s success is shaky? When the return on investments is hard to monitor.
But here’s the catch: they rarely come with built-in control, security, or the ability to handle big-time use. Setting them up for the local environment might be a breeze, but making them work for a whole company with all the bells and whistles (think security, uptime, and control) — that’s a whole different monster.
That’s usually where companies building open-source projects find a monetization strategy. Selling for security, governance, and scalability sounds like a good deal for both parties.
But what we often do instead of paying for enterprise or cloud versions is build teams of engineers.
Don’t get me wrong: the build vs buy question is still an important one. Probably the most important one.
Moreover, open source allows us to align on a kind of standard and to stay away from closed-source software lock-in and monolith legacy.
But don’t you feel that we’re unbalanced here?
How much professional tool do we pay to make our pipelines risk-free and professional?
Moreover, data engineering teams actually cost a lot. Just look at the last salary survey. A senior data engineer costs more than $100.000 a year fully charged.
It’s great to note here that the last layer in analytics — the dashboard — shows a bit more balance. Actors such as Tableau (Salesforce), Power BI, or Qlik present a big part of the usage while open-source alternatives (Metabase, Superset) are quite new projects comparatively.
Is it because the end data is more critical? Because business stakeholders use it?
Things are changing, and data is getting serious. It’s becoming critical for the business. We’re not just a couple of folks playing with numbers anymore.
To create true institutional trust around the work that data teams do, we might need to do what accountants do: Give our work some rules.
We are putting the rules up (through datamesh, SOMA standards, and other data governance policies). And just like accountants wouldn’t rely on personal spreadsheets forever, we gotta step up our tools.
We might need to buy in into the software that makes our lives easier, saves us cash, and gets the job done way better.
When to Pay? The Real Power of Open-Source
Why do we use open source? Because it’s free? Because it’s morally pleasant?
Why don’t we buy more into the great software made by developer experts and with great support?
Beyond the potential pricing blocker, what’s the background reason?
Keep reading with a 7-day free trial
Subscribe to From An Engineer Sight to keep reading this post and get 7 days of free access to the full post archives.