Skip to content
All posts

Book review: Fundamentals of Data Engineering

Skeletor reads FODE

If you’re immersed in the data world, then over the last few years it’s been impossible to escape the book ‘Fundamentals of Data Engineering’ by Joe Reis and Matt Housley. And even if you haven’t come across the book, its writers have become hard to avoid via various other media.

Although Matt Housley is perhaps a little more conspicuous, Joe Reis seems to be everywhere, be it podcasts, conferences or blogs. I’m a subscriber to both Joe’s Substack and the Practical Data Modelling Substack he runs too and often find myself nodding along sagely to his writings - he speaks a lot of sense!

BUT I'm NO DATA ENGINEER!

But it’s taken me until now to actually read the book. How come? I actually ordered it in June last year, but so far it’s been an ornament on my bookshelf, a hot title to be seen in the background during an umpteenth video call. In fairness, I always planned to read it, but I’ve done a LOT of reading in the last 12 months or so and this just happened to be a little bit down the pile. It finally got to the top…
 

I have previously talked about the fact that I don’t see myself as a Data Engineer. Historically I was a jack-of-all-trades BI developer who looked after the full data lifecycle, but as data disciplines have matured and niched down to more specific responsibilities, I’ve found myself more in the analytics space. My strengths lie in collaborating with stakeholders to understand their needs, modelling data to help answer their questions, and serving it via a means they can interact with (that could be a semantic layer or dashboards and reports).

Building platforms and pipelines is not really my bag, especially as the world has moved in a more code heavy direction.

But I’m also a completist. And I really believe that a good data professional should have a more holistic view of the world, understanding, at least in a rudimentary fashion, the upstream and downstream dependencies for which ever specialism they may choose.

The book 

I’m pleased to report that the book itself also shares this holistic approach. If you think this book is JUST about Data Engineering, you’re doing it a disservice.

The main premise for the content is the idea of the data engineering lifecycle. There are five main focusses of this lifecycle; Generation, Storage, Ingestion, Transformation and Serving. Each of these areas has six common undercurrents:

  • Security

  • Data Management

  • Data Ops

  • Data Architecture

  • Orchestration

  • Software Engineering

If I speak to most data engineers today, I think they would recognise the Ingestion and Transformation parts of this lifecycle as their bread and butter, but thinking about data from a cradle to crave perspective and inspecting the full breadth of its use prompts you to go deeper than this, and really gives a more complete picture of the engineering discipline. Part one of the book gives an overview of data engineering building blocks and sets context for part two, which contains a chapter for each of the data engineering lifecycle stages.

Generation

The Generation stage is really about source systems. Whilst data engineers are not necessarily responsible for the building and configuration of their data sources, this was a great journey through the types of data they may expect to work with alongside the pitfalls folks might encounter.
 

Storage

This underpins all stages of the data engineering lifestyle. In an age where we see a proliferation of new storage approaches, this was a welcome overview. Some parts perhaps go a bit too far into the weeds. Do I really need to know about the physics behind magnetic disk head movement and rotation? Perhaps not, but I guess the completist theme continues with this level of depth, visiting the raw ingredients of data storage (hence the HDD lesson) whilst the chapter does move onwards to talk about storage systems (e.g. RDBMS and Object Storage) as well as the abstractions that sit on top of these systems (e.g. Data Warehouses and Data Lakes). Some of the detail may seem unnecessary, but I feel better informed for having read it.
 

Ingestion

This covers how we get data into a platform, transporting it from source to storage, and all the things you may need to consider. There is a lot here covering two main approaches to ingestion; Batch versus Streaming. I think there is an acknowledgement that batch is still the most suitable for most use cases, but there’s a plethora of great information about the considerations you need to make when it comes to streaming too.
 

Transformation

For some one who personally sees themselves as a bit of a data engineering heathen, I found it refreshing and somewhat reassuring that this section focusses heavily on SQL as the language of choice for data transformation. SQL nerds rejoice! We’re still relevant!

This section also covers one of my favourite topics, data modelling. You need to have a plan for what shape to transform that data into, right? Popular approaches such as Inmon, Kimball, Data Vault and OBT all get a mention here, with a pragmatic tone that does not necessarily espouse one over the other.

Serving

 

Ah, my baby… This is probably the area I feel most at home in. Again, although data engineers might not always be responsible for this stage of the lifecycle, they are very much responsible for providing the inputs to it, and so context about this is really important for a good engineer.

I especially enjoyed the line about making data products being a full-contact sport. Here serving is broken into three main categories; Analytics, Machine Learning and Reverse ETL. Analytics gets a further breakdown, differentiating between business analytics and operational analytics, including a soundbite I’m sure to recycle many times:

Operational analytics versus business analytics =

immediate action versus actionable insights

I also enjoyed the fact there is a call out that “reverse ETL” is a godawful name for the practice, whilst earlier in the book they had also acknowledged the following:

“Reverse ETL has long been a practical reality in data, viewed as an antipattern that we didn’t like to talk about or dignify with a name”

The Undercurrents

Each of the above chapters concludes with how each of the six undercurrents plays into the data engineering lifecycle stage in question. I found this device a really nice way to pull together ideas in the book in a consistent manner. I’d also urge any data architect to apply the six undercurrents to any platforms they are planning, as they feel like a robust way to make sure design choices are being well considered.
 

Part Three

The book concludes with a Part Three that visits two subjects; Security and The Future of Data Engineering. Security features throughout the book as one of the undercurrents, but it gets its own dedicated chapter here too, stressing the importance of this topic.
 

The ‘Future of…’ chapter was certainly interesting in terms of some of the predictions. The “decline of complexity and the rise of easy-to-use data tools” feels like something already in flight (dbt certainly seems to take us down this path), but interestingly, this direction of travel is one reason that this book is still really relevant for people like me, who have perhaps been alienated from the data engineering discipline due to it code heavy nature and perceived complexity.

I also found the prediction around interoperability and the rise of metadata catalogues somewhat eerie, given the emergence in the space of the likes of Unity Catalog, Nessie and Polaris.

I’m perhaps less enamoured by the idea of the “live data stack” and the move towards real-time analytics. I kind of feel like that notion has already been somewhat quashed, and it certainly seems trendy for leading lights in data to recommend that batch is good enough, but we shall see. If the move to live data does pan out, then the rich amount of information about streaming contained within the book will certainly prove useful.

Overall Thoughts

First thing to really call out is how well written and accessible this book is. I am definitely guilty of having various copies of other data “must reads” sat gathering dust on my shelf, abandoned half read due to their dryness. I surprised myself that I rattled though the 400-ish pages of FODE in around a fortnight.

Secondly, the book gave me comfort that the fundamentals I’m familiar with are still true in the current era of data platforms. Joe Reis himself wrote recently that “Fundamentals are Gravity”. Ten years ago I was using SQL Server on-premise and SSIS to build data warehouses. That technology seems pretty dated now, but a lot of the things thought about back then, still apply now. The more the world changes, the more is stays the same, right?

That’s not to say that I know it all, and I feel like reading FODE has only contributed to me becoming further rounded as a data professional. The hype associated with this book seems justified to me, and I’d encourage anyone working in the data space, not just data engineers, to own a copy.

It is of course available for purchase via Amazon - https://amzn.to/41E6Q4k - yep, that is an affiliate link, and if you decide to buy a copy of the book based on this recommendation, I’d really appreciate your support.