A technical blog about my projects, challenges, and discoveries in the world of data warehousing using SQL Server, Power BI Desktop, DevExpress, and more.

Monday, March 7, 2016

Using an Operational DB Key as a "Fact" in Factless Fact Tables

As most of the readers of this blog know from other communication sources, I recently moved from Shelby Systems, Inc. to Custom Data Systems, Inc. One reason for my change is that I now have the opportunity to cut my teeth on a data warehouse project from start to finish, something that I have long wanted to do ever since learning about data warehousing at my first PASS Summit event back in 2010. At that event I bought the book The Data Warehouse Toolkit by Ralph Kimball and Margy Ross, and I have used it as a resource ever since. Indeed, it has been a constant source of help to me during these last couple of months as I have put the theory of data warehousing into practice at my new employer. On the topic of helpful books, I should also mention Star Schema: The Complete Reference by Christopher Adamson. I highly recommend both books for those just getting started with data warehouse. For those with some experience at star schema modeling, Adamson's book could still be quite helpful, as he presents a large number of practical examples of various challenges in star schema design as well as discussion of approaches quite different from the Kimball method.

For the pure design of the star schema data warehouse database, these resources never led me wrong. However, when it came to the point of implementing analysis on top of the database per se, I found that there was an omission that in retrospect I find quite surprising. It came to my attention as I tried to perform analysis of data from two separate dimensions linked via a so-called "factless" fact table using Power BI Desktop as the platform for the analysis. This problem may be unique to Power BI Desktop, but I doubt it. I believe it is a symptom of basing the output on the "tabular" model of analysis rather than on an analysis cube or on a single query.

The Problem

Simply put, the problem is this: unless a value from the fact table is included in the analysis in some way, the relationship between the dimensions via the intervening fact is ignored. I discovered the problem when I created a bar chart visualization based on the following three tables:

Individuals Dimension
Individuals Key (surrogate key)
Individual ID (operational db key)
First Name
Last Name
Member Status
Billing Classification
etc.

Dates Dimension
Dates Key (surrogate key)
Date
Calendar Year
Calendar Month
etc.

Daily Summary Fact (associates each individual with each date, for point-in-time analysis)
Individuals Key
Dates Key

After adding these three tables into Power BI and relating them together, I dropped [Dates].[Calendar Year] into the bar chart's "Axis" property, and I dropped [Individuals].[Individual ID] into the "Values" property and changed it to "Count Distinct" so that I could see how many members there were each year. I also added [Individuals].[Member Status] to the "Legend" property so that each bar would be subdivided by member type.

I was surprised that this yielded exactly the same numbers for the overall totals and for the shaded areas for every year in the output. That seemed odd even for my imaginary test data, so I started looking into the values in the operational database and in the data warehouse. After several ad hoc queries, I confirmed that the results in the graph were erroneous. And eventually I discovered that the numbers were the overall totals for all the data, not broken out by year. So the ultimate conclusion I came to was that the graph was ignoring the [Daily Summary] fact table that related individuals to dates and was, instead, showing all individuals for all dates.

Two Not-Quite Satisfactory Solutions


The standard approaches for doing counts on factless fact tables such as the [Daily Summary] table are:

  • Apply the COUNT() function to any of the surrogate key values in the fact table.
  • Add a "counter" column with a value of 1 in each row.
While these are fine for a basic count of values, neither works for COUNT DISTINCT. Obviously a "counter" column would return a COUNT DISTINCT value of 1, since every row has the exact same value. And the surrogate key of an individual is not a reliable value for COUNT DISTINCT because the same individual can have multiple surrogate keys, assuming the dimension captures changes over time (the so-called "slowly changing dimension"), which in the case of the [Individuals] dimension it certainly is.

All the examples from the Kimball book and the Adamson book are based on the assumption that the aggregations will be handled in a SQL query with an INNER JOIN between the dimension and "factless" fact tables. And in SQL of course the relationships are honored even if the SELECT clause does not have any columns from the fact table per se. But in Power BI and (probably) other analysis tools using the "tabular model" instead of single sets of query results, this is evidently not the case.

My Solution

The solution that worked for me is to include the operational database identity key directly into the fact table. This is never suggested by Kimball or Adamson (that I found in print anyway), but it resolves the problem by putting the one "dimensional" value that I would want to aggregate with COUNT DISTINCT into the fact table as a column that can be dropped into the Values property of any Power BI visualization. My revised [Daily Summary] fact now looks like this:

Daily Summary Fact
Individuals Key
Dates Key
Individual ID

Using [Daily Summary].[Individual ID] instead of [Individuals].[Individual ID] in my Power BI graph yields the expected and accurate results.

Thinking beyond the specific problem of "factless" fact tables, I can anticipate wanting to use COUNT DISTINCT on the [Individual ID] value for a variety of analysis contexts, even when the fact table might in fact contain more traditional facts such as monetary values or other additive quantities. So my current plan is to incorporate the operational key value in every relevant fact table, whether "factless" or not.

Conclusion

Even though literature on data warehousing with the star schema places the identity key of the operational database only in dimensions, I have discovered that it is useful and in some cases necessary to put it in fact tables too. From now on, that value goes in both places in my star schema designs. If you use Power BI Desktop, Excel Power Pivot, or any other self-service BI tool based on the tabular model, you might want to do the same and save yourself some headaches.

No comments:

Post a Comment

Followers