Fixing Invalid Geography Data

One of my fellow forumites over at the SQL Server Spatial Forum has been having a few headaches when trying to import shapefile data in to the geography datatype, so this post is designed to explain some of the pitfalls you may come across, and how to resolve them. Note that if you are planning on using a lot of complex shape data, I highly recommend using a dedicated tool such as Safe FME - however, I'm going to attempt to show some methods of fixing invalid data using nothing other than SQL Server and Morten Nielsen's popular Shape2Sql loader.

To demonstrate how to import ESRI shapefile data into the geography datatype, you'll need a shapefile loader, and some sample data, so download the following:

Working out the correct datatype and SRID

Unzip the file. It contains 5 files, which each describe different aspects of this dataset. The first step is to work out what spatial reference system the coordinates have been expressed in, which is contained in the tl_2008_us_state.prj file. Load this up in notepad and you will see the following: (line breaks inserted)

  DATUM["D_North_American_1983", SPHEROID["GRS_1980",6378137,298.257222101]],

The GEOGCS keyword at the beginning tells us that this shapefile is defined using geographic coordinates - excellent, so we'll use the geography datatype. But which SRID should we choose? If we look at the other parameters contained in the tl_2008_us_state.prj file, we see that this data is based on the North American 1983 datum, using an Earth radius of 6378137m and an inverse-flattening ratio of 298.257222101. Let's search through the sys.spatial_reference_systems table in SQL Server and see if this matches any of the supported spatial reference systems:

FROM sys.spatial_reference_systems 
WHERE Well_Known_Text LIKE 'GEOGCS%1983%6378137%298.257222101%'
We get one result back tha matches all these parameters - 4269 - so we now know that we can import this data into the geography datatype using the SRID 4269. Let's go!

The problem

Load up Shape2Sql, and select the tl_state_2008.shp. We know that the correct settings for this file are Geography(Spheric), and SRID 4269. We'll call the spatial column geog. Click Upload.

Oh dear, after not very long you get an error message as follows:

Never mind, maybe it's just one rogue state. Click OK to carry on with the rest of the data set...
Oh dear, it wasn't just one rogue state. The total list of states that can't be loaded is as follows:

  • Massachusetts
  • New Mexico
  • Wyoming
  • Montana
  • Texas
  • Alabama
  • North Carolina
  • Arizona
  • Tennessee
  • Utah

The solution

We can't do anything with our data until we get it into SQL Server somehow. The error message tells us that it's not valid geography data, but maybe we can import it into a geometry column instead. As I'm sure you know by now, SQL Server 2008 has two different spatial datatypes - geometry and geography. There are lots of similarities between the two types, and many shapes can be represented using either type. For example, the polygon POLYGON((2 52, 0 52, 0 50, 2 50, 2 52)), represented using SRID 4326, can be used with either the geometry or the geography datatype:

DECLARE @WKT VARCHAR(255) = 'POLYGON((2 52, 0 52, 0 50, 2 50, 2 52))';
SELECT geography::STGeomFromText(@WKT, @SRID)
SELECT geometry::STGeomFromText(@WKT, @SRID)

However, some shapes can only be represented using the geometry datatype. This includes not only shapes defined using a projected coordinate system, but 'invalid' shapes (such as a polygon with a self-intersecting outer ring), or shapes larger than a hemisphere. So, although Shape2Sql can't import our data directly into the geography datatype, maybe it can import it into the geometry datatype - we could then try to use some of the methods provided by SQL Server 2008 to analyse and make the data suitable for the geography datatype. To find out, change the geometry properties setting on Shape2Sql to "planar geometry" (keep the same SRID) and change the column name to be geom. Then reimport the data (overwriting the existing, partial table). Brilliant, every row imports successfully.

We now have a table that contains a geometry column. What we're going to try to do is fix the data in that column before using it to define a geography column representing the same features. So let's add a geography column to the table

ALTER TABLE [tl_2008_us_state]
ADD geog geography

When we first tried to use Shape2Sql to import geography data, some of the records were imported successfully, so let's move those across to the geography column to start with

UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.STAsBinary(), geom.STSrid)
WHERE NAME NOT IN('Massachusetts', 'New Mexico', 'Wyoming', 'Montana', 'Texas', 'Alabama', 'North Carolina', 'Arizona', 'Tennessee', 'Utah');
(46 ROW(s) affected)

Ok, now let's get to work on the non-valid instances.

Part 1. Fixing any invalid geometries with MakeValid()

The geometry datatype allows "invalid" geometries, whereas the geography datatype was not. So, let's see if that was the reason why any of our records could not be imported, by using the STIsValid() method:

WHERE NAME IN('Massachusetts', 'New Mexico', 'Wyoming', 'Montana', 'Texas', 'Alabama', 'North Carolina', 'Arizona', 'Tennessee', 'Utah');

It seems that only Alabama is invalid, so let's fix that one by using MakeValid() on it and trying to import it into the geography column again.

UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.MakeValid().STAsBinary(), geom.STSrid)
WHERE NAME = 'Alabama'
-- (1 row(s) affected)

So that's Alabama sorted. Only 9 to go.

Part 2. Ensuring the correct ring orientation with Spatial Ed's trick

Another important criteria for the geography datatype is that the points that define polygon rings must have the correct ring orientation. In simplistic terms, if you have the wrong ring orientation then your polygon will be "inside-out" - the area that you thought was contained inside the polygon will be the exterior, and the area that you thought was on the outside will be inside the polygon. Fortunately, there's an easy trick to fixing this, which is explained on Ed Katibah's blog. So let's give it a try. Using the brute force approach of trying each remaining state in turn, we find that Massachusetts can be resolved this way:

UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.STUnion(geom.STStartPoint()).STAsBinary(), geom.STSrid)
WHERE NAME = 'Massachusetts'
-- (1 row(s) affected)

So that's Massachusetts sorted. Only 8 to go.

Part 3. Buffer()-ing and UnBuffer()-ing

We've made sure that our polygons are valid (at least, in the geometry datatype they're valid), and they have the correct ring orientation, so what else could be wrong?

The tl_2008_us_state shapefile is very detailed - the outline of each state is defined by tens of thousands of points, which are very close together. When calculated on the planar surface of the geometry datatype,we know that the line drawn between these points does not intersect itself (if it did, then STIsValid() would result in 0). However, maybe when you connect the points on the ellipsoidal surface of the geography datatype, SQL Server calculates that the line actually intersects itself, which would result in an invalid geography instance. We need some way of smoothing the points while not changing the overall shape of each state. One approach is to buffer (i.e. expand) each shape by a very small amount and then buffer it again using a negative value to contract it again. This has the effect of rounding corners, which can resolve problems caused by sharp "spikes" in the geometry. When we try this approach using a buffer of 0.00001, we find it works for four of the remaining states, as follows:

UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.STBuffer(0.00001).STBuffer(-0.00001).STAsBinary(), geom.STSrid)
WHERE NAME = 'New Mexico'
-- (1 row(s) affected)
UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.STBuffer(0.00001).STBuffer(-0.00001).STAsBinary(), geom.STSrid)
WHERE NAME = 'Wyoming'
-- (1 row(s) affected)
UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.STBuffer(0.00001).STBuffer(-0.00001).STAsBinary(), geom.STSrid)
WHERE NAME = 'North Carolina'
-- (1 row(s) affected)
UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.STBuffer(0.00001).STBuffer(-0.00001).STAsBinary(), geom.STSrid)
WHERE NAME = 'Tennessee'

That leaves Texas, Montana, Arizona, and Utah

Part 4. Reduce()

We have one more tool left in our SQL Server 2008 arsenal for fixing invalid geography data - the Reduce() method. Like the Buffer()/Unbuffer() approach, this will actually result in changing the points in the instance (or, specifically, removing them). There are many cases where you might want to simplify your polygon instances using Reduce() - to make your queries more efficient, for example. However, in this case, I'm going to try to reduce the instances by the smallest amount possible, supplying the minimum tolerance I can just to make the instances valid. Using this approach, I can fix the last four polygons as follows:

UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.Reduce(0.00001).STAsBinary(), geom.STSrid)
WHERE NAME = 'Texas'
UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.Reduce(0.000001).STAsBinary(), geom.STSrid)
WHERE NAME = 'Montana'
UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.Reduce(0.00001).STAsBinary(), geom.STSrid)
WHERE NAME = 'Arizona'
UPDATE tl_2008_us_state
SET geog = geography::STGeomFromWKB(geom.Reduce(0.000001).STAsBinary(), geom.STSrid)

The Result

That's it! Using one method or another, we have been able to "fix" all of the geometry rows to be valid for the geography datatype. To test the geography data, try this:


and the results...

Spatial Results tab showing geography polygons from the tl_2008_us_state file


Fixing Invalid Geography Data

Great article again ! (All your info is so 'to the point', pun intended).

But I have another situation that you may have come up against.
USA ZIP codes; Arizona in particular.
I inported our Map Info ZIPs using FWTools to shape files. I imported the shape file to SQL 2008 using Shape2SQL. No problem so far. Some of the shapes (ZIPS) are multipolygons and one set of polygons have single parenthesis "(" around the points , not double "((". Like so: I've cut down the points for this example.

MULTIPOLYGON (((-111.94461300969124 33.378685995936394, -111.94431000947952 33.378695994615555, -111.9442940056324 33.378712996840477, -111.94428899884224 33.378685995936394)),
((-111.97762799263 33.376549005508423, -111.97736901044846
33.376693993806839, -111.97563698887825 33.376638993620872, -111.97545799612999 33.376552000641823, -111.97762799263 33.376549005508423)),
((-111.92428800463677 33.349101006984711, -111.92431101202965 33.349153995513916, -111.92432600259781 33.349227994680405, -111.92305099964142 111.92584699392319 33.349236994981766, -111.92580598592758 33.349105000495911, -111.92428800463677 33.349101006984711),
-111.95171898603439 33.38068999350071, -111.95177599787712 33.380695998668671, -111.95183399319649 33.380731999874115, -111.95131701231003 33.380691006779671, -111.95171898603439 33.38068999350071)))

The 2nd poly and the 3rd poly have single ") (" only.

So I 'fixed' it by putting the second "),(" around the data. But it won't display the shape on the map. If I leave out the last poly, it works OK.

So my questions are: Have you come across this single paran problem before ? If so, do you have any suggestions?

Thanks too for your book. Excellent ! And by the way, I'm using your 'handler.ashx' framework for parsing out this data and constructing the java script code.



Well explained! Keep it up.

Well explained! Keep it up.

Shape2SQL updated

Thanks for a great blogpost! I updated my tool so it will try and correct the features automatically using the approaches you describe here. The shapefile you link to now loads straight in.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options