Book review: Fundamentals of Data Engineering
If you’re immersed in the data world, then over the last few years it’s been impossible to escape the book ‘Fundamentals of Data Engineering’ by Joe Reis and Matt Housley. And even if you haven’t come across the book, its writers have become hard to avoid via various other media.
Although Matt Housley is perhaps a little more conspicuous, Joe Reis seems to be everywhere, be it podcasts, conferences or blogs. I’m a subscriber to both Joe’s Substack and the Practical Data Modelling Substack he runs too and often find myself nodding along sagely to his writings - he speaks a lot of sense!
BUT I'm NO DATA ENGINEER!
I have previously talked about the fact that I don’t see myself as a Data Engineer. Historically I was a jack-of-all-trades BI developer who looked after the full data lifecycle, but as data disciplines have matured and niched down to more specific responsibilities, I’ve found myself more in the analytics space. My strengths lie in collaborating with stakeholders to understand their needs, modelling data to help answer their questions, and serving it via a means they can interact with (that could be a semantic layer or dashboards and reports).
Building platforms and pipelines is not really my bag, especially as the world has moved in a more code heavy direction.
But I’m also a completist. And I really believe that a good data professional should have a more holistic view of the world, understanding, at least in a rudimentary fashion, the upstream and downstream dependencies for which ever specialism they may choose.
The book
The main premise for the content is the idea of the data engineering lifecycle. There are five main focusses of this lifecycle; Generation, Storage, Ingestion, Transformation and Serving. Each of these areas has six common undercurrents:
-
Security
-
Data Management
-
Data Ops
-
Data Architecture
-
Orchestration
-
Software Engineering
If I speak to most data engineers today, I think they would recognise the Ingestion and Transformation parts of this lifecycle as their bread and butter, but thinking about data from a cradle to crave perspective and inspecting the full breadth of its use prompts you to go deeper than this, and really gives a more complete picture of the engineering discipline. Part one of the book gives an overview of data engineering building blocks and sets context for part two, which contains a chapter for each of the data engineering lifecycle stages.
Generation
Storage
Ingestion
Transformation
This section also covers one of my favourite topics, data modelling. You need to have a plan for what shape to transform that data into, right? Popular approaches such as Inmon, Kimball, Data Vault and OBT all get a mention here, with a pragmatic tone that does not necessarily espouse one over the other.
Serving
Ah, my baby… This is probably the area I feel most at home in. Again, although data engineers might not always be responsible for this stage of the lifecycle, they are very much responsible for providing the inputs to it, and so context about this is really important for a good engineer.
I especially enjoyed the line about making data products being a full-contact sport. Here serving is broken into three main categories; Analytics, Machine Learning and Reverse ETL. Analytics gets a further breakdown, differentiating between business analytics and operational analytics, including a soundbite I’m sure to recycle many times:
Operational analytics versus business analytics =
immediate action versus actionable insights
I also enjoyed the fact there is a call out that “reverse ETL” is a godawful name for the practice, whilst earlier in the book they had also acknowledged the following:
“Reverse ETL has long been a practical reality in data, viewed as an antipattern that we didn’t like to talk about or dignify with a name”
The Undercurrents
Part Three
The ‘Future of…’ chapter was certainly interesting in terms of some of the predictions. The “decline of complexity and the rise of easy-to-use data tools” feels like something already in flight (dbt certainly seems to take us down this path), but interestingly, this direction of travel is one reason that this book is still really relevant for people like me, who have perhaps been alienated from the data engineering discipline due to it code heavy nature and perceived complexity.
I also found the prediction around interoperability and the rise of metadata catalogues somewhat eerie, given the emergence in the space of the likes of Unity Catalog, Nessie and Polaris.
I’m perhaps less enamoured by the idea of the “live data stack” and the move towards real-time analytics. I kind of feel like that notion has already been somewhat quashed, and it certainly seems trendy for leading lights in data to recommend that batch is good enough, but we shall see. If the move to live data does pan out, then the rich amount of information about streaming contained within the book will certainly prove useful.
Overall Thoughts
Secondly, the book gave me comfort that the fundamentals I’m familiar with are still true in the current era of data platforms. Joe Reis himself wrote recently that “Fundamentals are Gravity”. Ten years ago I was using SQL Server on-premise and SSIS to build data warehouses. That technology seems pretty dated now, but a lot of the things thought about back then, still apply now. The more the world changes, the more is stays the same, right?
That’s not to say that I know it all, and I feel like reading FODE has only contributed to me becoming further rounded as a data professional. The hype associated with this book seems justified to me, and I’d encourage anyone working in the data space, not just data engineers, to own a copy.
It is of course available for purchase via Amazon - https://amzn.to/41E6Q4k - yep, that is an affiliate link, and if you decide to buy a copy of the book based on this recommendation, I’d really appreciate your support.