Welcome to my brain. This is a peek into the way it ticks, and the fact that this particularly random issue popped in to my head at 3am and then proceeded to keep me awake for the rest of the night. The issue in question: what’s the best way to deal with unknown dates in Power BI?

The scenario

What do I even mean? The scenario I’m talking about is where your fact table has more than one date associated with it. A classic example might be an online retail organisation, where their sales table shows the date an order was placed and then the date the order was delivered.

Ordinarily the  event that creates this record will be a customer placing an order and I’d expect the order date to be populated in my fact table at the point of the sale. However, the delivery date hasn’t happened yet and as such is unknown. So how should we be representing that unknown date?

Back to fundamentals

The first port of call for me is to go back to fundamentals and ask one of my favourite questions, “What would Kimball do?”

In Data Warehouses we’ve always had the concept of unknown or default members in dimension tables. Moving away from dates for a moment, if we made a sale of a new product, but for whatever reason the product info hadn’t yet been loaded into our operational product master data, then when it comes to loading our Data Warehouse we may find that the sales data exists in the fact table, but the product itself doesn’t yet exist in the product dimension table. To deal with this scenario, an “unknown member” is created in our product dimension, often with a surrogate key of -1. When the fact table is loaded, we lookup our surrogate keys from the product dimension table and any that aren’t found default to the unknown member value of -1.

This method ensures that we retain referential integrity in the Data Warehouse and that we can write efficient SQL queries using inner joins.

When the product data is updated with this new product, we can go back and update the foreign key in the fact table to point at the correct product.

Dealing with dates

So how would we deal with that situation when it comes to an unknown date? Well, much the same. In my Data Warehousing days we would often use a smart surrogate key for dates, meaning we’d express the date as an integer to use for the primary to foreign key relationships. Today for example would be 20230304.

Where a date was unknown, we’d replace that with a default date for unknown members. Often this would be 19000101 though in some examples, rather than going backwards in time I’ve also seen the default date set to a point in the future (21000101 is another example I’ve seen). In fact in the past I have combined the two; using 19000101 for genuinely unknown dates for an event that has happened versus 21000101 for events that are unknown because they haven’t happened yet but I know will occur in the future. Corresponding records for these unknown dates would still exist in our date dimension.

For our online retail example, if an order was placed today I’d likely set the order date key as 20230304 and the delivered date key as 21000101.

The Power BI approach

So is this approach suitable for Power BI too? The idea of referential integrity (RI) for Power BI is still very much valid. These articles by Mihaly Kavasi and Phil Seamark both discuss the importance of RI.

So the approach we’d take for a Kimball style Data Warehouse is the right one for Power BI, right?

Wrong.

Why Power BI is different

When it comes to date dimensions in Power BI there are some special characteristics associated with it to help facilitate time intelligence in DAX. For instance, the date table should always start on the 1st January and always end on 31st December. There is also the requirement for the date table to be contiguous. This means no gaps between days in the calendar.

So if we go with the example outlined above and we use 1900-01-01 for unknown dates and 2100-01-01 for future unknown events that would mean having a date table running from 1st January 1900 until 31st December 2100 – over 200 years’ worth of date records! (It’s also worth noting that in a Power BI scenario, I store these as dates, not as smart key integers, as per this SQLBI article)

From a data modelling efficiency perspective and the memory consumed by all those extra records, I’m not too concerned. It’s still only c.75k records. But the user experience is shoddy! Imagine every time a self-service user wants to use your model for ad-hoc analysis that they have to select from over 200 years of dates when they’re probably only really interested in what’s happened in the recent past.

My original solution

My original approach to solving his problem was to shorten the gap between my unknown low date and unknown high date. It’s 2023… most clients I work with are only really interested in looking at 5 or 10 years of data at most. So I started to use 1st January 2000 as my low date and 31st December 2030 for my high date. Better, but honestly still a bit rubbish.

So what’s a better way? Having had my sleepless night I jumped on the Guy In A Cube livestream and asked the question as part of their “Power BI Tips from the Pros” segment. It prompted some really interesting debate in the chat, and my knights in shining armour came in the shape of Marco Russo and Paul Turley… are there 2 more prestigious names in the Power BI space?

The way forward

After a bit of to-and-fro over Data Warehouse best practice versus Power BI best practice (yes, it turns out the 2 aren’t necessarily the same) we arrived at this: Unknown dates in Power BI datasets should be represented as NULL.

The reasoning for this is so that we can keep our date table topical and only include a date range that is relevant and valid for the data we’re analysing, but also because having dummy dates in the model can also have unintended side effects on time intelligence calculations.

This presents us with a bit of a catch-22 scenario. By taking this approach we no longer have referential integrity in our data model. However, the advice from Marco (which I’m more than happy to follow) is that the price of those RI violations is a cheaper penalty to pay than the DAX needed to deal with dummy dates, and we retain a better experience for our users. I also take comfort in the fact that in all scenarios I can think of, the RI violation will happen on an inactive relationship, with active relationships ordinarily being tied to the originating point of our event (order date in our example).

Other approaches

The same conversation cropped up again on my Twitter feed recently, and resulted in some good conversation with Tom Martens. He takes a different approach to this, using the end of his calendar for unknown dates and also marking the records as “projected” where he’s using this dummy date. This means he does retain RI and he can use the projected state flag to discount the records with dummy dates from any measures. Tom’s opinion is another that I highly value when it comes to data modelling, so it becomes clear that there isn’t necessarily one right answer to this dilemma.

Personal preference

However, personally I’ve decided that going forward I’m going to adopt the NULL approach. It’s worth pointing out that if your model includes a time table as well, that the same principle applies. And I do still value referential integrity in my data models. Certainly for all other dimensions (such as Product, Customer, Supplier etc..) I’ll continue to use unknown members, but from now on I’ll be treating Date and Time as exceptions.

Categories: Power BI

1 Comment

Irfan Charania · March 9, 2023 at 10:33 pm

Thanks for this post! I’ve been wrestling with this as well, and as you said, since this typically occurs with the inactive relationships, it hasn’t been a big deal. I hadn’t considered the “projected” approach but will definitely keep it in mind for when it may be more suitable.

Leave a Reply

Your email address will not be published. Required fields are marked *