A technical blog about my projects, challenges, and discoveries in the world of data warehousing using SQL Server, Power BI Desktop, DevExpress, and more.

Wednesday, March 30, 2016

Geocoding on the Cheap (i.e. Free)

It has been a long time since I've been so excited about sharing a database technique on my blog as I am about this one. Geocoding enables a wide range of geographical reporting that is almost always the most whiz-bang part of any demonstrations of reporting output. But the data behind the flashy maps has been hard to come by, and it is usually considered too pricey for most small-to-medium sized users of databases because in most situations geocoding individual addresses requires a fee-based license agreement with Google, Bing, or some other source of geocoding information.

But I recently learned that if you are willing to compromise the precision of the data just a little and use approximate locations of your address information instead of the exact locations, you can do geocoding at completely no cost. That's right: free. The "trick" is that you have to be okay with using the geographic center of the ZIP code area as the location of each address in that ZIP code. According to http://www.mapszipcode.com the average ZIP code area is 90 square miles, with some much smaller than that and a few larger. If my math is correct, that means that the largest difference between the actual address and the center of the ZIP code (on average) is 6.7 miles. Most differences will be closer than that.

If that margin of error is within your tolerance for geographical reporting, read on.

Step One: Download a Free Data Source of ZIP Code Latitude and Longitude Information

Learning that there are free data sources of ZIP code lat/long information was the "aha" moment that opened up in my mind all the possibilities of doing geocoding reporting without expensive licensing agreements. I did some searching to find the best source and found the following links, which I provide for your review. There are good tips on these, and I won't waste space repeating what they say.


The last one listed is the site from which I chose to download the data. It is in a simple comma-delimited file format, easy to work with.

Step Two: Prepare the Data Source for Import into SQL Server

What you have to do in this step will be determined by which data source file you download in the previous step and the method of importing the data you plan to use. The data from boutell.com is quote-and-comma delimited, which is not a format easily handled by the bulk import utility I used for the import. Also, it contains two time-zone columns that are not relevant to the geocoding process. So I opened up the source file in Excel, deleted the two columns of time-zone data, and then saved the file with a new name and in a basic CSV file format.

Be aware that there are a handful of cities that use an apostrophe in the name of the city. The bulk insert process I used handles that fine, but if you use a different method, you may need to locate those cities and either remove them or deal with them in a special way.

Step Three: Prepare the Database for the Import

This might be the trickiest step for some. To use the same bulk import utility that I did, you will need to create a brand new table to receive the data. If you use a different method for the import, you might not need to do that. If you make a table, it might be a temporary table (if you have another, more permanent table where the geocoding data will ultimately land) or it might be a permanent addition to your database. How you name and use the table will depend on whether it is temporary or permanent. In my production environment, this was just a temporary table.  Here is the script I used to make the table:

create table zcta (
 zip char(5) primary key,
 city varchar(64),
 state char(2),
 latitude float,
 longitude float );

Step Four: Import the Data

I chose to use a bulk insert routine executed from SQL Server Management Studio. Here is a blog post from another SQL Server user showing how simple the process is:


Because of the data clean-up in Step Two, my process was pretty much identical to the one shown in the video, except for the file location and file name.

Now that you have every meaningful ZIP code in your database along with their latitude and longitude values, you can connect that lat/long information with the addresses in your database that have a ZIP code. Getting to this point is the major hurdle to geocoding. Once you get to this point, the rest is easy.

Step Five: Add a Geography Column

The lat/long information is essential, but it is not in a format that SQL Server can readily use for calculating distances. There is a special data type called "geography" that is specifically designed for this purpose. There is a lot you could learn about the geography data type, but here's all you need to know:

  • The "geography" data type was introduced in SQL Server 2008.
  • There are sub-types for "geography" values, but the only one you need is POINT.
  • A geography "point" is defined by longitude and latitude (in that order).
  • As a SQL CLR function, the functions related to the geography data type are case sensitive.
  • The easiest syntax for populating a geography point is this: geography::Parse('POINT (long lat)')
  • The geography data type can be declared for table columns, view columns, and SQL variables.
Note in that syntax example, Long and Lat should be replaced by string versions of the longitude and latitude values, respectively.

For my purposes, I wanted the geography data to be calculated for each ZIP code just once, so I added a computed column onto the permanent version of the ZIP code lookup table. Here is the code I used:

alter table [dbo].[ZIP Code Lookup] add [GeoLocation] as geography::Parse('POINT(' + convert(varchar(20), [Longitude]) + ' ' + convert(varchar(20), [Latitude]) + ')') persisted

The longitude and latitude values are stored in the float data type in the table, so they have to be converted to varchar in order to be concatenated into the geography function. The "persisted" key word at the end means that the value is calculated just once and then stored semi-permanently. It will only be re-calculated if either the longitude or latitude value changes in the lookup table.

Step Six: Use the Geography Column to Calculate Distance

Here is the payoff of all the work above. With the geography column representing the lat/long position of the address ZIP codes, you can calculate approximate distances between any two addresses with the STDistance function, just like this:

[Geography Point 1].STDistance([Geography Point 2])

This will return the distance in meters between the points. If you want miles, divide the result by 1609.344. In my formula, I also round the result to the nearest mile to reflect the fact that this is an approximation and not based on the exact locations.

Using this function, it would be possible to create a query that can return all the addresses that fall within a specified distance of a given location code, based on their ZIP codes. I have not yet created visualizations of this sort of thing, but that is certainly the next step. When I have done it, I'll post a follow-up with the steps for that process, picking up where this one leaves off.

[Edit] If you want to know more about the geography data type and its related geometry data type, a good overview can be had here: http://www.jasonfollas.com/blog/archive/2008/03/14/sql-server-2008-spatial-data-part-1.aspx.

No comments:

Post a Comment