SELECT Statements: March 2016

Wednesday, March 30, 2016

Geocoding on the Cheap (i.e. Free)

It has been a long time since I've been so excited about sharing a database technique on my blog as I am about this one. Geocoding enables a wide range of geographical reporting that is almost always the most whiz-bang part of any demonstrations of reporting output. But the data behind the flashy maps has been hard to come by, and it is usually considered too pricey for most small-to-medium sized users of databases because in most situations geocoding individual addresses requires a fee-based license agreement with Google, Bing, or some other source of geocoding information.

But I recently learned that if you are willing to compromise the precision of the data just a little and use approximate locations of your address information instead of the exact locations, you can do geocoding at completely no cost. That's right: free. The "trick" is that you have to be okay with using the geographic center of the ZIP code area as the location of each address in that ZIP code. According to http://www.mapszipcode.com the average ZIP code area is 90 square miles, with some much smaller than that and a few larger. If my math is correct, that means that the largest difference between the actual address and the center of the ZIP code (on average) is 6.7 miles. Most differences will be closer than that.

If that margin of error is within your tolerance for geographical reporting, read on.

Step One: Download a Free Data Source of ZIP Code Latitude and Longitude Information

Learning that there are free data sources of ZIP code lat/long information was the "aha" moment that opened up in my mind all the possibilities of doing geocoding reporting without expensive licensing agreements. I did some searching to find the best source and found the following links, which I provide for your review. There are good tips on these, and I won't waste space repeating what they say.

http://www.uszipcodes.com/zip-code-latitude-longitude-quality.htm
http://greatdata.com/resources
https://boutell.com/zipcodes/

The last one listed is the site from which I chose to download the data. It is in a simple comma-delimited file format, easy to work with.

Step Two: Prepare the Data Source for Import into SQL Server

What you have to do in this step will be determined by which data source file you download in the previous step and the method of importing the data you plan to use. The data from boutell.com is quote-and-comma delimited, which is not a format easily handled by the bulk import utility I used for the import. Also, it contains two time-zone columns that are not relevant to the geocoding process. So I opened up the source file in Excel, deleted the two columns of time-zone data, and then saved the file with a new name and in a basic CSV file format.

Be aware that there are a handful of cities that use an apostrophe in the name of the city. The bulk insert process I used handles that fine, but if you use a different method, you may need to locate those cities and either remove them or deal with them in a special way.

Step Three: Prepare the Database for the Import

This might be the trickiest step for some. To use the same bulk import utility that I did, you will need to create a brand new table to receive the data. If you use a different method for the import, you might not need to do that. If you make a table, it might be a temporary table (if you have another, more permanent table where the geocoding data will ultimately land) or it might be a permanent addition to your database. How you name and use the table will depend on whether it is temporary or permanent. In my production environment, this was just a temporary table. Here is the script I used to make the table:

create table zcta (
 zip char(5) primary key,
 city varchar(64),
 state char(2),
 latitude float,
 longitude float );

Step Four: Import the Data

I chose to use a bulk insert routine executed from SQL Server Management Studio. Here is a blog post from another SQL Server user showing how simple the process is:

http://blog.sqlauthority.com/2012/06/20/sql-server-importing-csv-file-into-database-sql-in-sixty-seconds-018-video/.

Because of the data clean-up in Step Two, my process was pretty much identical to the one shown in the video, except for the file location and file name.

Now that you have every meaningful ZIP code in your database along with their latitude and longitude values, you can connect that lat/long information with the addresses in your database that have a ZIP code. Getting to this point is the major hurdle to geocoding. Once you get to this point, the rest is easy.

Step Five: Add a Geography Column

The lat/long information is essential, but it is not in a format that SQL Server can readily use for calculating distances. There is a special data type called "geography" that is specifically designed for this purpose. There is a lot you could learn about the geography data type, but here's all you need to know:

The "geography" data type was introduced in SQL Server 2008.
There are sub-types for "geography" values, but the only one you need is POINT.
A geography "point" is defined by longitude and latitude (in that order).
As a SQL CLR function, the functions related to the geography data type are case sensitive.
The easiest syntax for populating a geography point is this: geography::Parse('POINT (long lat)')
The geography data type can be declared for table columns, view columns, and SQL variables.

Note in that syntax example, Long and Lat should be replaced by string versions of the longitude and latitude values, respectively.

For my purposes, I wanted the geography data to be calculated for each ZIP code just once, so I added a computed column onto the permanent version of the ZIP code lookup table. Here is the code I used:

alter table [dbo].[ZIP Code Lookup] add [GeoLocation] as geography::Parse('POINT(' + convert(varchar(20), [Longitude]) + ' ' + convert(varchar(20), [Latitude]) + ')') persisted

The longitude and latitude values are stored in the float data type in the table, so they have to be converted to varchar in order to be concatenated into the geography function. The "persisted" key word at the end means that the value is calculated just once and then stored semi-permanently. It will only be re-calculated if either the longitude or latitude value changes in the lookup table.

Step Six: Use the Geography Column to Calculate Distance

Here is the payoff of all the work above. With the geography column representing the lat/long position of the address ZIP codes, you can calculate approximate distances between any two addresses with the STDistance function, just like this:

[Geography Point 1].STDistance([Geography Point 2])

This will return the distance in meters between the points. If you want miles, divide the result by 1609.344. In my formula, I also round the result to the nearest mile to reflect the fact that this is an approximation and not based on the exact locations.

Using this function, it would be possible to create a query that can return all the addresses that fall within a specified distance of a given location code, based on their ZIP codes. I have not yet created visualizations of this sort of thing, but that is certainly the next step. When I have done it, I'll post a follow-up with the steps for that process, picking up where this one leaves off.

[Edit] If you want to know more about the geography data type and its related geometry data type, a good overview can be had here: http://www.jasonfollas.com/blog/archive/2008/03/14/sql-server-2008-spatial-data-part-1.aspx.

Monday, March 7, 2016

Using an Operational DB Key as a "Fact" in Factless Fact Tables

As most of the readers of this blog know from other communication sources, I recently moved from Shelby Systems, Inc. to Custom Data Systems, Inc. One reason for my change is that I now have the opportunity to cut my teeth on a data warehouse project from start to finish, something that I have long wanted to do ever since learning about data warehousing at my first PASS Summit event back in 2010. At that event I bought the book The Data Warehouse Toolkit by Ralph Kimball and Margy Ross, and I have used it as a resource ever since. Indeed, it has been a constant source of help to me during these last couple of months as I have put the theory of data warehousing into practice at my new employer. On the topic of helpful books, I should also mention Star Schema: The Complete Reference by Christopher Adamson. I highly recommend both books for those just getting started with data warehouse. For those with some experience at star schema modeling, Adamson's book could still be quite helpful, as he presents a large number of practical examples of various challenges in star schema design as well as discussion of approaches quite different from the Kimball method.

For the pure design of the star schema data warehouse database, these resources never led me wrong. However, when it came to the point of implementing analysis on top of the database per se, I found that there was an omission that in retrospect I find quite surprising. It came to my attention as I tried to perform analysis of data from two separate dimensions linked via a so-called "factless" fact table using Power BI Desktop as the platform for the analysis. This problem may be unique to Power BI Desktop, but I doubt it. I believe it is a symptom of basing the output on the "tabular" model of analysis rather than on an analysis cube or on a single query.

The Problem

Simply put, the problem is this: unless a value from the fact table is included in the analysis in some way, the relationship between the dimensions via the intervening fact is ignored. I discovered the problem when I created a bar chart visualization based on the following three tables:

Individuals Dimension
Individuals Key (surrogate key)
Individual ID (operational db key)
First Name
Last Name
Member Status
Billing Classification
etc.

Dates Dimension
Dates Key (surrogate key)
Date
Calendar Year
Calendar Month
etc.

Daily Summary Fact (associates each individual with each date, for point-in-time analysis)
Individuals Key
Dates Key

After adding these three tables into Power BI and relating them together, I dropped [Dates].[Calendar Year] into the bar chart's "Axis" property, and I dropped [Individuals].[Individual ID] into the "Values" property and changed it to "Count Distinct" so that I could see how many members there were each year. I also added [Individuals].[Member Status] to the "Legend" property so that each bar would be subdivided by member type.

I was surprised that this yielded exactly the same numbers for the overall totals and for the shaded areas for every year in the output. That seemed odd even for my imaginary test data, so I started looking into the values in the operational database and in the data warehouse. After several ad hoc queries, I confirmed that the results in the graph were erroneous. And eventually I discovered that the numbers were the overall totals for all the data, not broken out by year. So the ultimate conclusion I came to was that the graph was ignoring the [Daily Summary] fact table that related individuals to dates and was, instead, showing all individuals for all dates.

Two Not-Quite Satisfactory Solutions

The standard approaches for doing counts on factless fact tables such as the [Daily Summary] table are:

Apply the COUNT() function to any of the surrogate key values in the fact table.
Add a "counter" column with a value of 1 in each row.

While these are fine for a basic count of values, neither works for COUNT DISTINCT. Obviously a "counter" column would return a COUNT DISTINCT value of 1, since every row has the exact same value. And the surrogate key of an individual is not a reliable value for COUNT DISTINCT because the same individual can have multiple surrogate keys, assuming the dimension captures changes over time (the so-called "slowly changing dimension"), which in the case of the [Individuals] dimension it certainly is.

All the examples from the Kimball book and the Adamson book are based on the assumption that the aggregations will be handled in a SQL query with an INNER JOIN between the dimension and "factless" fact tables. And in SQL of course the relationships are honored even if the SELECT clause does not have any columns from the fact table per se. But in Power BI and (probably) other analysis tools using the "tabular model" instead of single sets of query results, this is evidently not the case.

My Solution

The solution that worked for me is to include the operational database identity key directly into the fact table. This is never suggested by Kimball or Adamson (that I found in print anyway), but it resolves the problem by putting the one "dimensional" value that I would want to aggregate with COUNT DISTINCT into the fact table as a column that can be dropped into the Values property of any Power BI visualization. My revised [Daily Summary] fact now looks like this:

Daily Summary Fact
Individuals Key
Dates Key
Individual ID

Using [Daily Summary].[Individual ID] instead of [Individuals].[Individual ID] in my Power BI graph yields the expected and accurate results.

Thinking beyond the specific problem of "factless" fact tables, I can anticipate wanting to use COUNT DISTINCT on the [Individual ID] value for a variety of analysis contexts, even when the fact table might in fact contain more traditional facts such as monetary values or other additive quantities. So my current plan is to incorporate the operational key value in every relevant fact table, whether "factless" or not.

Conclusion

Even though literature on data warehousing with the star schema places the identity key of the operational database only in dimensions, I have discovered that it is useful and in some cases necessary to put it in fact tables too. From now on, that value goes in both places in my star schema designs. If you use Power BI Desktop, Excel Power Pivot, or any other self-service BI tool based on the tabular model, you might want to do the same and save yourself some headaches.

SELECT Statements

About Me