Data Engineer is a Transitional Job

Dec 26, 2024

∙ Paid

We often get questions like “What’s the difference between a BI engineer and a data engineer”? “Is the analytics engineer the new BI engineer”?

Those questions are definitely relevant. Beyond buzz words, they catch the same feeling that data in general, isn’t really a new field. There are new words around, but the real background behind them is the same as ten years ago…

Back to Business Intelligence

Let’s step back 20 years ago.

Companies are already using data to make decisions. The paradigm already has a name: Business Intelligence. The main roles regarding data are Database Administrator, Reporting Analyst, and Business Intelligence Manager.

At this time the data scientist is more of a PhD Researcher and the data engineer doesn’t really exist.
Business Intelligence people deliver insights and rely on proper tools to do so. The end user doesn’t play with data directly, it asks the BI team for reports.

No self-service, no Big Data, no neural network, no semantic layer, no cloud provider, no machine learning, etc.

So what has changed?

Everything: business plans, economies, compute costs, etc.. But the real thing is “Big Data”.
Long story short, we live in a world creating way more data than 20 years ago. The velocity, volume, variety, and veracity of data are now huge.

This new shift brought new opportunities, but it also created new problems that can’t be solved within the Business Intelligence era.
Tools were very restrictive. Compute capacity was not able to handle billions of rows at a daily rate. Data processing couldn’t be done on high scales.
Corollary, people were trained to analyze tabular data and create basic charts. Not to deal with raw images, terabytes of unstructured tables, ingest real-time events, or run predictive analysis…

All the tooling has quickly moved to an obsolete state. And to a certain extent, BI people too.

Migration to data engineering

Then a new era appeared. Data moved to an even more engineer field as it was. Hence the data engineer.

There are several driving elements behind the shift towards cloud computing and the democratization of big data :

Hardware capability increased by a big factor.
Storage and compute costs decreased (allowing to store and try things non-fundamental for business at the time, but that would be key for innovation).
Storage and compute could scale exponentially while the investment in hardware was quite linear (for the sake of simplicity: two 8CPU machines are cheaper than a 16CPU one).
Networking improved with higher bandwidth and better technology — allowing easier data transfer on many nodes.

Those improvements were led by researchers and inside growing startups like Google, Yahoo!, Amazon, and Facebook. All of those companies faced the same challenge of indexing and storing Internet data at scale.

Quite naturally, some of them saw a disruptive opportunity :

“Hey, we have a lot of computers here ! What if we leverage this computer power to provide cloud-based services to businesses and individuals !?”

Cloud providers were born, several frameworks and code languages have been created, making it possible for businesses of all sizes to access powerful computing resources at a fraction of the cost of building and maintaining their own infrastructure.

🖇️ For more details on the whole history behind “Big Data”, read this great article

What we often miss in this new world is how data engineering is essentially a means to keep Business Intelligence practices up to date with the shift toward “big data”.

We build pipelines with new frameworks and program languages. BI did too with Talend, Pentaho, Datapine, MicroStrategy, etc.
We build dashboards with self-service tools. BI made dashboards too with SAP Business Objects, SAS Business Intelligence, PowerBI, etc.
We schedule jobs. BI did too with Cron, Rundeck, Control-M, and other similar tools.

Data engineers are to a certain extent a replacement for traditional BI tools.

When we look closely at those BI tools, we can wonder why we don’t use them that much nowadays? Why they are not part of the “Modern Data Stack” as they still have great features and bundle common services?

For example, many BI tools offer features like data governance, data quality management, and data lineage tracking: critical components of any modern data stack.

While data engineers have developed new tools and techniques to extract insights from big data, it’s essential to remember that many of the core features of BI tools are still relevant today and should not be overlooked.

The reason that traditional BI tools are not as commonly used today is that they lack the scalability and proper user experience required to handle the amount of data that organizations need to process. With data arriving at a faster pace than ever before, traditional BI tools simply were not designed to handle these constraints.

Hence the data engineers, are able to process any kind of data, create big data warehouse, and provision self-service data tools.
The emergence of a “modern data stack” is a direct response to the challenges faced by data engineers. Any new tools in the data industry now are here to enhance and automate the work of data engineers in some way.

While the technical solutions for managing big data have been available for some time, companies have been hesitant to adopt full automation due to the perception that it was cheaper to hire humans to develop and maintain projects rather than to invest in machines or start automation projects.

However, with the demand for skilled data engineers increasing and salaries rising in recent years, companies are now seeking new tools and technologies to fill this gap and help automate their data engineering processes. Putting money where there is more value.

Driving greater value

It’s interesting to note that the data industry often lags behind the software world by 5 to 10 years.

We data folks are starting to catch up though, and we’re adopting some of the same tools and techniques that software engineers have been using for years.

We end up writing Dockerfile, setting up CI/CD blocks, and using cloud services to host data-intensive applications. If it’s not DevOps what is it?
The data mesh is strongly inspired by software teams. New roles, such as machine learning engineers or analytics engineers are modeled by software needs.

Just as the front-end world moved from jQuery to React, and operations shifted from bash scripts to tools like Terraform and Kubernetes, the data industry is also moving towards a more declarative paradigm.

Keep reading with a 7-day free trial

Subscribe to From An Engineer Sight to keep reading this post and get 7 days of free access to the full post archives.