SELECT Statements

A technical blog about my projects, challenges, and discoveries in the world of data warehousing using SQL Server, Power BI Desktop, DevExpress, and more.

Thursday, January 16, 2020

Basic T-SQL Cursor syntax

I'm always looking for the basic syntax to create a cursor. I do it rarely, as is proper for SQL Server, so I have to remind myself of the details. The examples on Microsoft's online documentation are too elaborate, and I have to eliminate pieces in my mind to get to the core elements. Instead, I'm writing this blog post to cut to the chase. The varchar(100) specification is just a placeholder for whatever data type you need to store the value from the cursor query.

Here is the basic structure of a cursor, with comments to explain each section:

/* Create one or more variables to capture the values returned by the cursor query. */
declare @my_cursor_value as varchar(100)

/* Declare the cursor with the SELECT statement. */
/* Use fast_forward to optimize the cursor when you don't need to move backward in it. */
/* Use local to isolate the cursor so there is no chance of conflict with another instance of a cursor with the same name. */
declare [my_cursor] cursor fast_forward local for
select [my_column]
from [my_table]

/* Open the cursor, which prepares the query results. */
open [my_cursor]

/* Get the first set of values from the cursor and put the results in the defined parameters. */
fetch next from [my_cursor] into @my_cursor_value

/* Start a loop to use the already-fetched values, then pull the next set of cursor values. */
while @@Fetch_Status = 0

/* Enclose the steps you want to take using BEGIN...END */
begin
print @my_cursor_value

  /* Always include FETCH NEXT to avoid an endless loop. */
 fetch next from [my_cursor] into @my_cursor_value
end

/* Drop the results. */
close [my_cursor]

/* Release the cursor definition from memory. */
deallocate [my_cursor]


Here is the code stripped of comments, for a quick copy-and-paste of the essential structure:

declare @my_cursor_value as varchar(100)

declare [my_cursor] cursor fast_forward local for
select [my_column]
from [my_table]

open [my_cursor]

fetch next from [my_cursor] into @my_cursor_value

while @@Fetch_Status = 0

begin
print @my_cursor_value

 fetch next from [my_cursor] into @my_cursor_value
end

close [my_cursor]

deallocate [my_cursor]


Friday, October 12, 2018

Checking for the Existence of Database Objects for Idempotent Code

I am constantly having to search for the best way to determine if a certain kind of database object exists because there is no one way that works for all of the various kinds of objects. So I'm posting here all the various methods in one place, so I can easily find whichever code I need at the time. If it helps you too, all the better.

Usually when I'm looking for this kind of code when I need to make a SQL script that is repeatable any number of times without erroring out because something was already handled in a previous execution of the script. In technical parlance this is called "idempotence," the quality of producing the same result no matter how many times the script runs. The examples below apply the existence check in a bit of abbreviated code for writing idempotent code for each type of database object.

Create a Table

if not exists(select * from Information_Schema.Tables where Table_Schema = 'dbo' and Table_name = 'My_Table')
create table [dbo].[My_Table] (MyColumn int);

Update or Create a Stored Procedure:

if object_id('my_stored_procedure', 'P') is not null drop proc [dbo].[my_stored_procedure];
go
create procedure [dbo].[my_stored_procedure]
...

Add a Table Index

if not exists (select * from [sys].[indexes] as [i] inner join [sys].[tables] as [t] on [i].[object_id] = [t].[object_id] where [i].[name] = 'my_index' and [t].[name] = 'my_table])
create clustered index [my_index] on [my_table] ([my_column])

Replace clustered with whatever index type qualifiers are appropriate for the index you are creating.

Add a Column 

if col_length('[my_schema].[my_table]', 'my_column') is null
alter table [my_schema].[my_table] add [my_column] int not null

Replace int not null with whatever data type and qualifier you need for the new column.

Update or Create a Trigger

if object_id(N'[my_trigger]', 'TR') is not null
drop trigger [dbo].[my_trigger];
go

create trigger [dbo].[my_trigger] on [dbo].[my_table] 
for update
as
...

Add a Constraint

if object_id('my_constraint') is null alter table [my_table] add constraint my_constraint unique ([my_column]);

Add or Modify a User Defined Function

if object_id('my_function', 'FN') is not null drop function [dbo].[my_function];
go

create function [dbo].[my_function]
returns datatype
as
begin
 /* my code */
end

Wednesday, December 6, 2017

Displaying a button in a DevExpress DataGrid cell

This post is a perfect example of the primary purpose of this blog: to document obscure and difficult-to-remember steps to accomplish tasks in SQL Server or other platforms that I use for my work. Since coming to Custom Data Systems, Inc I have become a VB.Net developer, and in that capacity I have used a third-party development toolkit called DevExpress. In general, this toolkit makes it easy to do complex development tasks in Windows forms and in self-service business intelligence using a VB.Net context. But there is one task I have had to do a couple of times that I have wasted at least a couple of hours each time trying to figure out how to do: add a button to a data grid column. So to make this easier on myself the next time I have to do this, here are the necessary changes to a column definition to make it into a button:


  1. Run the Grid Designer tool for the data grid.
  2. Make sure the selected view is the one that should have a button in a column.
  3. Switch to the Columns page.
  4. Click on the column you want to change into a button.
  5. Click the drop-down option for the ColumnEdit property and choose New > ButtonEdit.
  6. Drill down under the ColumnEdit property family under Buttons > 0 - Ellipsis.
  7. Find the Image property and click the [...] button that appears when you give focus to the property.
  8. Switch to the DX Image Gallery tab and find an icon suitable for the context. This is what the button will look like. Click the icon and then click OK.
  9. Find the Kind property, which is set to Ellipsis by default, and change it to Glyph.
  10. Scroll down to the bottom of the ColumnEdit family of properties and find the TextEditStyle and change it to DisableTextEditor.
  11. Scroll down further in the main property list of the column. Find ShowButtonMode, and change it to ShowAlways.
That should do it. Seems simple, eh? But I forget every time to change the button "kind" to glyph, and without that one change, the button never shows up with the selected image. So hopefully the next time I have to put a button into a grid, I'll remember that I wrote this blog and will come here to remember all the steps. And if you work with DevExpress too, maybe it will help you. :-)

P.S. The column with the button must be editable (AllowEdit = True) in order for the user to be able to click on the button and trigger the repository item's "Click" event.  In a recent project I had turned the grid's master Editable property to False and was using the MouseDown event of the grid control instead of the  Click event of the repository item. But this caused the user to have to click twice on a tab control to move from one tab to the other whenever they had previously clicked on a button. So avoid MouseDown and use the Click event, making sure both the grid control as a whole and the individual column are both set to be editable.

Tuesday, December 5, 2017

Using Two List Boxes to Manage Key Word Tags (or Similar Values)

Recently I started using OutSystems as a web development tool because Microsoft decided to abandon Access Web Apps and replace it with a not-ready-for-prime-time tool called Power Apps. I tried to use Power Apps, but when I found it incapable of using SQL Server views of data to present information to the user, I decided it would not serve my purposes and found OutSystems.

I'm glad I did. OutSystems provides all the functionality I need plus room for growth down the road. It opens up new possibilities rather than restricting my options unreasonably as Power Apps would have done. It is called a "low code" environment as opposed to the "no code" environment Power Apps touts. I guess "low code" is what I need, because even though I have no prior web development experience, I was able to follow their tutorial project without any difficulty, and after having done that, I was able to reproduce all the functionality of the Access Web App I had already written for the Tobacco Farm Life Museum (run by my sister Melody) and then improve upon it.

All of that is preamble to this blog post, which is mainly about one of the ways I was able to improve upon the Access Web App design using the OutSystems tools. In the database, people and firms were associated with "interests" to enable key word searches based on the interests. The table that relates people (or firms) with interests is a bridge table with a many-to-many relationship: one interest can be associated with many people, and one person can be associated with many interests. Originally, the only way to edit the interests of either a firm or a person was to open that one firm or person's record, go to the list of interests for that entity, and then edit the interests list. This meant that adding the same interest to a lot of people or firms was time-consuming. But then I remembered a user tool in the Shelby Systems software that I used to work with that allowed users to assign multiple people to a particular option using two list boxes side-by-side, one list of "available" names and one list of "selected" names. I wondered if I could achieve that same effect using OutSystems, and it turned out that I could. Here is a screenshot of the kind of interface I was able to create.



There are several steps to the process of creating such an interface, and the rest of this blog post is here to document those steps. If you use OutSystems, you can download a sample project from the Forge library that I created to show how to implement this solution. If you don't use OutSystems but want to produce the same effect, hopefully the steps here will help you with the outline of the process, even if you have to implement it differently.

The names in the steps below are generic to represent three database tables (or entities in OutSystems parlance): a MainEntitiy (for the entity that will be associated with one or more categorical tags), a TagEntity (for the simple categorical tag entity), and a BridgeEntity (for the table that links the main entities with their tags).

  1. Create a session variable to hold the entity identifier for the tag entity: TagEntitySearchValue.
  2. Add a blank web screen page to the project and name it appropriately: ManageTagAssignments.
  3. Add a local variable to the page that represents the main entities that are not yet associated with a selected tag entity. It should be a "List of Record" data type that includes the main entity table itself and an additional Boolean value to indicate if the user has selected the entity in the list box: MainEntitiesWithoutTags.
  4. Add another local variable to the page that represents the main entities that are associated with the selected tag. It should also be a "List of Record" data type with the bridge entity plus a Boolean value to indicate if the user has selected the entity in the list box: MainEntitiesWithTags.
  5. Add a third local variable to the page that represents one association of a main entity with a tag entity. It should be defined using the bridge entity: CurrentBridgeEntity
  6. Add a preparation to the web page. It will have four steps between the Start and End points.
    • Add an aggregate to get the list of tag entities. You can make this simply by dragging the tag entity from the Data tab onto the preparation sequence.
    • Add a second aggregate to get the main entities not associated with the selected tag entity: GetMainEntitiesWithoutTags.
      • Include both the main entity and the bridge entity. There are a few additional changes that you need to make to this aggregate.
      • Edit the join condition and add a requirement that the bridge table tag entity identifier value must equal the session variable you set up in Step 1: BridgeEntity.TagEntitiyId = Session.TagEtitiySearchValue.
      • Add a filter so that the aggregate only includes people without a matching tag entity value: BridgeEntity.TagEntityId = NullIdentifier().
      • Add a filter so that the list is empty unless there has been a value assigned to the session variable you set up in Step 1 above: Session.TagEntitySearchValue <> NullIdentifier().
      • Add a new column with a Boolean value of False so that the results of the aggregate will match all the required values of the local variable: IsSelected.
    • Add a third aggregate to get the main entities that are associated with the selected tag entity: GetMainEntitiesWithTags.
      • Include both the main entity and the bridge entity.
      • Add a filter to limit the list to main entities with the selected tag entity: BridgeEntity.TagEntityId = Session.TagEntitySearchValue.
      • Add a filter so that the list is empty unless there has been a value assigned to the session variable: Session.TagEntitySearchValue <> NullIdentifier().
      • Add a new column with a Boolean value of False to represent selected values: IsSelected.
    • Add an assign action to populate two of the local variables with the related aggregate lists:
      • MainEntitiesWithTags = GetMainEntitiesWithTags.List
      • MainEntitiesWithoutTags = GetMainEntitiesWithoutTags.List
  7. Add a title to the new web page.
  8. Add a container to the main body of the web page.
  9. Put a combo box inside the container to list the available "tag" values. Give it an appropriate name. Set the other properties as follows:
    • Variable -- the session variable for the tag entity value.
    • Mandatory - true
    • Source Record List - the prepared aggregate for the list of tag entities.
    • Source Attribute - the attribute from the prepared aggregate that is the plain text value for the tag entity.
    • Special List Value 1 - None
    • Special List Option 1 - No Interest Selected.
  10. Add a button widget to the right of the combo box. Change the label to an appropriate action for making the selected item the active tag. I labeled mine as "Search." Make the destination a new screen action.
  11. Add a refresh data action to the sequence and choose the aggregate in the preparation that contains the unselected entities.
  12. Add a second refresh data action to the sequence and choose the aggregate that contains the selected entities.
  13. Return to the main edit screen for the web page interface.
  14. Add a second container to the main body of the page, underneath the first.
  15. Add four containers inside the second container. Change the width of each to be 6 columns wide. Also change the left margin of all the containers to 0. This will create a two-by-two block of containers. 
  16. Inside the first nested container, add a label with a value describing main entities not associated with the tag. Set the width of the label to the same width as the container, 6 columns. Set the style class to Heading 2.
  17. In the container underneath the label you just made, add an expression. The expression will complete the header begun in the label above it with a lookup of the plain text for the selected tag entity value or a generic description when there is no tag selected: If(Session.TagEntitySearchValue = NullIdentifier(), "Tag", GetTagEntity(Session.TagEntitySearchValue).TagEntity.TagDescription).
  18. Set the style class for the expression you just created to Heading 2.
  19. In the container to the right of the first label, add another label with a value describing the entities associated with the tag.
  20. In the container underneath the second label, add another expression exactly the same as the one in Step 13.
  21. Add another container to the page, underneath the others at the same level as the container you created in Step 10. (The exact hierarchical positions of all these containers is not critical as long as the layout works.)
  22. Add three more containers inside the one you just made, and change their column widths to 5, 1, and 6 respectively. Change the left margin of the far right container to 0 so that it is aligned exactly below the labels above it.
  23. Add a list box to the container on the far left and give it a name designating it for the main entities not associated with the selected tag value: UnassociatedListBox. Assign the properties as listed below:
    • SourceRecordList: Use the local variable for the list of main entities not associated with the selected tag value: MainEntitiesWithoutTags.
    • SourceAttribute: Use the value from the local variable for the plain text value to display: MainEntitiesWithoutTags.MainEntities.FullName.
    • SelectionAttribute: Use the attribute from the local variable for the selection state: MainEntitiesWithoutTags.IsSelected. Even though the value for this attribute is set as "False" initially, the list box will overwrite that value to reflect the current user selections.
    • Width: Match the width of the container. 
    • Height: Choose a reasonably long height to balance scrolling within the list box and scrolling the browser pane.
  24. Add a list box to the container on the far right and give it a name designating it for those entities associated with the selected tag value: AssociatedListBox. Assign the properties as listed for Step 18, except for the SourceRecordList use the local variable for the selected entities: MainEntitiesWithTags.
  25. In the container between the two list boxes, add an icon widget. Give it a name signifying the process to add a new entity: AddMainEntities. Change the Source Web Block / Name property to Entities.IconName.angle_double_right and the Size to Entities.IconSize.Size_3x.
  26. Link the icon widget to a new screen action.
  27. Open the new action to edit it. Rename it appropriately: AddMainEntities.
  28. Drag a "for each" action to the sequence and assign its record list to the list box for entities not associated with the selected tag: UnassociatedListBox.List.
  29. Drag an "if" action beside the sequence near the "for each" action and set its condition property to the IsSelected value of the current item in the list: UnaffiliatedListBox.List.Current.IsSelected. Since this is a Boolean value, it can serve as the condition expression all by itself.
  30. Drag a connection from the "for each" action to the "if" action to create the first step in the "for each" cycle.
  31. Drag an "assign" action beside the sequence just below the "if" condition. Make the following assignments:
    • The main entity id for the local variable representing the "bridge" table should be set to the value of the current list box item for entities not-yet-associated with the tag:  CurrentBridgeEntity.MainEntityId assigned to UnassociatedListBox.List.Current.MainEntities.MainEntityId.
    • The tag entity id for the local variable representing bridge entity should be set to the value of the currently selected tag item: CurrentBridgeEntity.TagEntityId assigned to  Session.TagEntitySearchValue.
    • The bridge entity identifier for the local variable should be set to a null value so that a new entry will be created for the relationship in the next step: CurrentBridgeEntity.BridgeEntityId associated with NullIdentifier().
  32. Drag a connection from the "if" condition to the "assign" condition to define the first action if the condition is true.
  33. Drag a connection from the "if" condition back to the "for each" condition to close the loop when the condition is false and trigger the next iteration of the cycle.
  34. Drag a "run server action" action beside the sequence below the "assign" action. Select the "create" action under the bridge entity: CreateBridgeEntity. As the source for this action, choose the local variable for the current bridge entity: CurrentBridgeEntity.
  35. Drag a connection from the "assign" action to the "run server action" action to define the cycle  step in the cycle when the "if" condition is true. Then drag a connection from the "run server action" action to the "for each" action to close the loop when the condition is true and trigger the next iteration of the cycle.
  36. Add a "refresh data" action to the sequence after the "for each" action and choose the aggregate that contains the list of main entities not associated with the selected tag. This will refresh the related list box and the local variable tied to it.
  37. Add a second "refresh data" action to the sequence and choose the aggregate that contains the list of main entities associated with the selected tag. This will refresh the related list box and the local variable tied to it.
  38. Return to the main edit screen for the web page.
  39. Add another icon widget underneath the one to add entities to the tag. Set this one's name appropriately and then set the Source Web Block Name to Entities.IconName.angle_double_left and its size to Entities.IconSize.Size_3x.
  40. Link the new icon widget to a new screen action.
  41. Open the screen action and name it appropriately: RemoveMainEntities.
  42. Add a "for each" action to the sequence and set the record list to the list of the box containing the names already associated with the tag: AssociatedListBox.List.
  43. Drag an "if" action beside the sequence near the "for each" action and set its condition to the IsSelected value of the current item in the list of entities associated with the tag: AssociatedListBox.List.Current.IsSelected.
  44. Drag a connection between the "for each" action and the "if" action to define the first step in the cycle.
  45. Drag a "run server action" action beside the sequence underneath the "if" action. Set it to the server action to delete the relationship between the main entity and the tag from the bridge table: DeleteBridgeEntity. Set the Id property of the action to the entity Id value for the current row of the list of entities associated with the tag: AssociatedListBox.List.Current.BridgeEntity.BridgeEntityId.
  46. Drag a connection between the "if" action and the "run server action" action to define the next step of the loop if the condition is true. 
  47. Drag a connection between the "if" action and the "for each" action to close the loop when the condition is false and move to the next iteration of the cycle.
  48. Drag a connection between the "run server action" action and the "for each" action to close the loop when the condition is true and move to the next iteration of the cycle.
  49. Drag a "refresh data" action to the sequence after the "for each" action and choose the aggregate that contains the list of main entities not associated with the selected tag.
  50. Add a second "refresh data" action to the sequence after the "for each" action and choose the aggregate that contains the list of main entities already associated with the selected tag.
  51. Add a link from a web page in your main flow to the new web page.

Wednesday, March 30, 2016

Geocoding on the Cheap (i.e. Free)

It has been a long time since I've been so excited about sharing a database technique on my blog as I am about this one. Geocoding enables a wide range of geographical reporting that is almost always the most whiz-bang part of any demonstrations of reporting output. But the data behind the flashy maps has been hard to come by, and it is usually considered too pricey for most small-to-medium sized users of databases because in most situations geocoding individual addresses requires a fee-based license agreement with Google, Bing, or some other source of geocoding information.

But I recently learned that if you are willing to compromise the precision of the data just a little and use approximate locations of your address information instead of the exact locations, you can do geocoding at completely no cost. That's right: free. The "trick" is that you have to be okay with using the geographic center of the ZIP code area as the location of each address in that ZIP code. According to http://www.mapszipcode.com the average ZIP code area is 90 square miles, with some much smaller than that and a few larger. If my math is correct, that means that the largest difference between the actual address and the center of the ZIP code (on average) is 6.7 miles. Most differences will be closer than that.

If that margin of error is within your tolerance for geographical reporting, read on.

Step One: Download a Free Data Source of ZIP Code Latitude and Longitude Information


Learning that there are free data sources of ZIP code lat/long information was the "aha" moment that opened up in my mind all the possibilities of doing geocoding reporting without expensive licensing agreements. I did some searching to find the best source and found the following links, which I provide for your review. There are good tips on these, and I won't waste space repeating what they say.

http://www.uszipcodes.com/zip-code-latitude-longitude-quality.htm
http://greatdata.com/resources
https://boutell.com/zipcodes/

The last one listed is the site from which I chose to download the data. It is in a simple comma-delimited file format, easy to work with.

Step Two: Prepare the Data Source for Import into SQL Server


What you have to do in this step will be determined by which data source file you download in the previous step and the method of importing the data you plan to use. The data from boutell.com is quote-and-comma delimited, which is not a format easily handled by the bulk import utility I used for the import. Also, it contains two time-zone columns that are not relevant to the geocoding process. So I opened up the source file in Excel, deleted the two columns of time-zone data, and then saved the file with a new name and in a basic CSV file format.

Be aware that there are a handful of cities that use an apostrophe in the name of the city. The bulk insert process I used handles that fine, but if you use a different method, you may need to locate those cities and either remove them or deal with them in a special way.

Step Three: Prepare the Database for the Import


This might be the trickiest step for some. To use the same bulk import utility that I did, you will need to create a brand new table to receive the data. If you use a different method for the import, you might not need to do that. If you make a table, it might be a temporary table (if you have another, more permanent table where the geocoding data will ultimately land) or it might be a permanent addition to your database. How you name and use the table will depend on whether it is temporary or permanent. In my production environment, this was just a temporary table.  Here is the script I used to make the table:

create table zcta (
 zip char(5) primary key,
 city varchar(64),
 state char(2),
 latitude float,
 longitude float );


Step Four: Import the Data


I chose to use a bulk insert routine executed from SQL Server Management Studio. Here is a blog post from another SQL Server user showing how simple the process is:

http://blog.sqlauthority.com/2012/06/20/sql-server-importing-csv-file-into-database-sql-in-sixty-seconds-018-video/.

Because of the data clean-up in Step Two, my process was pretty much identical to the one shown in the video, except for the file location and file name.

Now that you have every meaningful ZIP code in your database along with their latitude and longitude values, you can connect that lat/long information with the addresses in your database that have a ZIP code. Getting to this point is the major hurdle to geocoding. Once you get to this point, the rest is easy.

Step Five: Add a Geography Column


The lat/long information is essential, but it is not in a format that SQL Server can readily use for calculating distances. There is a special data type called "geography" that is specifically designed for this purpose. There is a lot you could learn about the geography data type, but here's all you need to know:

  • The "geography" data type was introduced in SQL Server 2008.
  • There are sub-types for "geography" values, but the only one you need is POINT.
  • A geography "point" is defined by longitude and latitude (in that order).
  • As a SQL CLR function, the functions related to the geography data type are case sensitive.
  • The easiest syntax for populating a geography point is this: geography::Parse('POINT (long lat)')
  • The geography data type can be declared for table columns, view columns, and SQL variables.
Note in that syntax example, Long and Lat should be replaced by string versions of the longitude and latitude values, respectively.

For my purposes, I wanted the geography data to be calculated for each ZIP code just once, so I added a computed column onto the permanent version of the ZIP code lookup table. Here is the code I used:

alter table [dbo].[ZIP Code Lookup] add [GeoLocation] as geography::Parse('POINT(' + convert(varchar(20), [Longitude]) + ' ' + convert(varchar(20), [Latitude]) + ')') persisted

The longitude and latitude values are stored in the float data type in the table, so they have to be converted to varchar in order to be concatenated into the geography function. The "persisted" key word at the end means that the value is calculated just once and then stored semi-permanently. It will only be re-calculated if either the longitude or latitude value changes in the lookup table.

Step Six: Use the Geography Column to Calculate Distance


Here is the payoff of all the work above. With the geography column representing the lat/long position of the address ZIP codes, you can calculate approximate distances between any two addresses with the STDistance function, just like this:

[Geography Point 1].STDistance([Geography Point 2])

This will return the distance in meters between the points. If you want miles, divide the result by 1609.344. In my formula, I also round the result to the nearest mile to reflect the fact that this is an approximation and not based on the exact locations.

Using this function, it would be possible to create a query that can return all the addresses that fall within a specified distance of a given location code, based on their ZIP codes. I have not yet created visualizations of this sort of thing, but that is certainly the next step. When I have done it, I'll post a follow-up with the steps for that process, picking up where this one leaves off.

[Edit] If you want to know more about the geography data type and its related geometry data type, a good overview can be had here: http://www.jasonfollas.com/blog/archive/2008/03/14/sql-server-2008-spatial-data-part-1.aspx.

Monday, March 7, 2016

Using an Operational DB Key as a "Fact" in Factless Fact Tables

As most of the readers of this blog know from other communication sources, I recently moved from Shelby Systems, Inc. to Custom Data Systems, Inc. One reason for my change is that I now have the opportunity to cut my teeth on a data warehouse project from start to finish, something that I have long wanted to do ever since learning about data warehousing at my first PASS Summit event back in 2010. At that event I bought the book The Data Warehouse Toolkit by Ralph Kimball and Margy Ross, and I have used it as a resource ever since. Indeed, it has been a constant source of help to me during these last couple of months as I have put the theory of data warehousing into practice at my new employer. On the topic of helpful books, I should also mention Star Schema: The Complete Reference by Christopher Adamson. I highly recommend both books for those just getting started with data warehouse. For those with some experience at star schema modeling, Adamson's book could still be quite helpful, as he presents a large number of practical examples of various challenges in star schema design as well as discussion of approaches quite different from the Kimball method.

For the pure design of the star schema data warehouse database, these resources never led me wrong. However, when it came to the point of implementing analysis on top of the database per se, I found that there was an omission that in retrospect I find quite surprising. It came to my attention as I tried to perform analysis of data from two separate dimensions linked via a so-called "factless" fact table using Power BI Desktop as the platform for the analysis. This problem may be unique to Power BI Desktop, but I doubt it. I believe it is a symptom of basing the output on the "tabular" model of analysis rather than on an analysis cube or on a single query.

The Problem

Simply put, the problem is this: unless a value from the fact table is included in the analysis in some way, the relationship between the dimensions via the intervening fact is ignored. I discovered the problem when I created a bar chart visualization based on the following three tables:

Individuals Dimension
Individuals Key (surrogate key)
Individual ID (operational db key)
First Name
Last Name
Member Status
Billing Classification
etc.

Dates Dimension
Dates Key (surrogate key)
Date
Calendar Year
Calendar Month
etc.

Daily Summary Fact (associates each individual with each date, for point-in-time analysis)
Individuals Key
Dates Key

After adding these three tables into Power BI and relating them together, I dropped [Dates].[Calendar Year] into the bar chart's "Axis" property, and I dropped [Individuals].[Individual ID] into the "Values" property and changed it to "Count Distinct" so that I could see how many members there were each year. I also added [Individuals].[Member Status] to the "Legend" property so that each bar would be subdivided by member type.

I was surprised that this yielded exactly the same numbers for the overall totals and for the shaded areas for every year in the output. That seemed odd even for my imaginary test data, so I started looking into the values in the operational database and in the data warehouse. After several ad hoc queries, I confirmed that the results in the graph were erroneous. And eventually I discovered that the numbers were the overall totals for all the data, not broken out by year. So the ultimate conclusion I came to was that the graph was ignoring the [Daily Summary] fact table that related individuals to dates and was, instead, showing all individuals for all dates.

Two Not-Quite Satisfactory Solutions


The standard approaches for doing counts on factless fact tables such as the [Daily Summary] table are:

  • Apply the COUNT() function to any of the surrogate key values in the fact table.
  • Add a "counter" column with a value of 1 in each row.
While these are fine for a basic count of values, neither works for COUNT DISTINCT. Obviously a "counter" column would return a COUNT DISTINCT value of 1, since every row has the exact same value. And the surrogate key of an individual is not a reliable value for COUNT DISTINCT because the same individual can have multiple surrogate keys, assuming the dimension captures changes over time (the so-called "slowly changing dimension"), which in the case of the [Individuals] dimension it certainly is.

All the examples from the Kimball book and the Adamson book are based on the assumption that the aggregations will be handled in a SQL query with an INNER JOIN between the dimension and "factless" fact tables. And in SQL of course the relationships are honored even if the SELECT clause does not have any columns from the fact table per se. But in Power BI and (probably) other analysis tools using the "tabular model" instead of single sets of query results, this is evidently not the case.

My Solution

The solution that worked for me is to include the operational database identity key directly into the fact table. This is never suggested by Kimball or Adamson (that I found in print anyway), but it resolves the problem by putting the one "dimensional" value that I would want to aggregate with COUNT DISTINCT into the fact table as a column that can be dropped into the Values property of any Power BI visualization. My revised [Daily Summary] fact now looks like this:

Daily Summary Fact
Individuals Key
Dates Key
Individual ID

Using [Daily Summary].[Individual ID] instead of [Individuals].[Individual ID] in my Power BI graph yields the expected and accurate results.

Thinking beyond the specific problem of "factless" fact tables, I can anticipate wanting to use COUNT DISTINCT on the [Individual ID] value for a variety of analysis contexts, even when the fact table might in fact contain more traditional facts such as monetary values or other additive quantities. So my current plan is to incorporate the operational key value in every relevant fact table, whether "factless" or not.

Conclusion

Even though literature on data warehousing with the star schema places the identity key of the operational database only in dimensions, I have discovered that it is useful and in some cases necessary to put it in fact tables too. From now on, that value goes in both places in my star schema designs. If you use Power BI Desktop, Excel Power Pivot, or any other self-service BI tool based on the tabular model, you might want to do the same and save yourself some headaches.

Wednesday, September 25, 2013

A First Glance at Prepared Queries or, as I think of them, "Temporary Stored Procedures"

Since my last post I have taken a new position as DBA / Business Intelligence Developer at Shelby Systems, Inc. In my new role I am still doing a lot of report development but also diving into the SQL used by our applications to pull data, looking for ways to improve their efficiency.

Needless to say, my posts here will reflect all the new things I'm learning about T-SQL and SQL Server in general beyond the usual reports-driven information I have acquired in my work. And this week I learned a brand new technique for handling queries: "preparing" them.

In the world of T-SQL the word "prepare" has a very specific definition, which is something like this: "load a query statement into memory and assign it a handle (ID) so that it can be executed repeatedly by means of the handle." Typically "prepared" queries are also "parameterized," meaning that the query requires one or more parameters to be defined with values before the query can return results.

There are four stored procedures to handle "prepared" queries, and the rest of this entry will explain each one.

sp_prepare is the command to use in order to "prepare" a query statement and define one or more parameters for it. It accepts three parameters itself.

The first parameter is an reference to a variable that has already been defined as an integer in an earlier DECLARE statement. This parameter must also be defined as an OUTPUT parameter, meaning that the stored procedure will assign the integer value to the variable.

The second parameter is an nvarchar value of one or more variable definitions. The string value for this parameter should be formatted just like a DECLARE statement but without the word DECLARE.

The third parameter is also an nvarchar value and contains the query to be prepared. Here is an example of using the sp_prepare stored procedure to prepare a query that pulls the members of a family from Shelby.NANames based on a provided FamNu value.

declare @FamilyMembers int;
exec sp_prepare @FamilyMembers output, N'@FamilyNumber int', N'select NameCounter, FirstMiddle, LastName, Birthdate from Shelby.NANames where FamNu = @FamilyNumber';


The @FamilyMembers variable will contain the integer assigned by the sp_prepare procedure as the "handle" for the prepared query. From this point in the SQL script, I can reference @FamilyMembers to execute the prepared query instead of typing it all out again. There is a special stored procedure for executing prepared queries, and that is discussed next.

sp_execute is the command to execute prepared queries. It accepts two parameters. The first parameter is a reference to the handle of the prepared query. Normally this takes the form of the variable used to define the prepared query in the sp_prepare procedure. The second parameter is a varchar list of the value(s) to be used for the parameter(s) in the prepared query. The following two lines of code will execute the prepared query from the example code above for two different families.

exec sp_execute @FamilyMembers, '2';
exec sp_execute @FamilyMembers, '5';

As you can see, after the query has been prepared, it is very easy to execute the query over and over with various parameter values. These two commands (sp_prepare and sp_execute) are so commonly found with the second immediately following the first that Microsoft added a stored procedure to do them both together.

sp_prepexec prepares a query and immediately executes it. It has the same three parameters as sp_prepare but adds a fourth parameter for the initial set of variable values for the execution of the prepared query. The following code shows how the same family member query can be prepared and executed all at once.

exec sp_prepexec @FamilyMembers output, N'@FamilyNumber int', N'select NameCounter, FirstMiddle, LastName, Birthdate from Shelby.NANames where FamNu = @FamilyNumber', '7';

With the advent of sp_prepexec, the sp_prepare stored procedure is hardly ever used. But sp_execute remains as the only way to re-execute a prepared query after the first time. And if you don't plan to execute the query more than once, "preparing" the query is not really worth the effort.

sp_unprepare is the final stored procedure for dealing with prepared queries. As its name states, this one removes a prepared query from memory. It has only one parameter, which is the handle of the prepared query to unprepare. As with sp_execute, this is normally the variable that was defined to capture the handle from the sp_prepare or sp_prepexec stored procedure. The final example is how to "clean up" the memory after the family members prepared query is no longer needed.

exec sp_unprepare @FamilyMembers;

And in case you want to run the whole script all together in SQL Server Management Studio, here is the complete script to parepre, execute, and remove the sample prepared query:

declare @FamilyMembers int;

exec sp_prepexec @FamilyMembers output, N'@FamilyNumber int', N'select NameCounter, FirstMiddle, LastName, Birthdate from Shelby.NANames where FamNu = @FamilyNumber', '7';

exec sp_execute @FamilyMembers, '2';

exec sp_execute @FamilyMembers, '5';

exec sp_unprepare @FamilyMembers;

Although prepared queries are not something I can imagine using for report designs, they would be helpful for applications that need to load the same columns of information over and over from the same table(s) with just one or two variables to define each time a user of the application loads. Of course, a permanently defined stored procedure would do just as well, if not slightly better. But if you can't define a permanent stored procedure, "preparing" a frequently-used query is the way to go.

Followers