Taking Advantage of Today's 'Smarter' GIS Data Formats: A Graphic Tour of the New Curvefitter Extension for FME

Author's note: I wrote this article as a contractor for TCI Software.

The State of Geodata
GIS data holds lots of expectations for users. The data sets are supposed to enable pretty maps with straight lines and smooth curves. They are supposed to be small and compact for easy storage and sharing. And, they are supposed to be accurate in representing what's on the ground.

Those are lofty and sometimes incompatible goals. For years those developing GIS and other types of representational software have tried to balance those demands with others, including the speed of processing. One of the choices some software developers made was not to store curved geometry in mathematical formulae, but rather as "stroked" arcs. Stroking refers to representing curves with many short, straight segments. This storage format sped up some types of calculations. As a result of this decision, there's quite a lot of mapping data in use today in a variety of formats that include stroked arcs. These data sets can be the cause of poor aesthetics in print and online maps (including sharp points and jagged lines), large files (in some cases hundreds of percent larger) and representations that do not match curves in the real world (such as highways and streams).

The good news is that data formats and software products are maturing. Today more software products than ever can analyze datasets in formats that store mathematical curves. What software products and what formats? Most CAD formats have been able to store and use mathematical arcs since they were introduced, as far back as the 1980s. AutoCAD, MicroStation (and many of the GIS products built on them) and other CAD packages store and manipulate entities using curves. Oracle Spatial stores curves, so, too, do Intergraph's GeoMedia Warehouses (Access/SQL Server), Autodesk's MapGuide's SDF3 and GeoConcept's native format. ESRI has updated its key storage format, the geodatabase, to support mathematical curves as well.

This is great news for users of these packages as they create new datasets because they can store data directly in these compact data formats. These formats also provide an opportunity to optimize stroked curves in legacy data. Optimize? That's right, this is the time to make the best choices about how to turn those stroked lines, which may represent linear features or be part of complex polygons, into mathematical curves that meet data user needs. How? By using the new Curvefitter extension to Safe Software's FME 2007.

TCI Software developed Curvefit 15 years ago as an AutoCAD add-on to address the "stroking" challenge, primarily encountered when legacy data was imported. When Safe Software completed the move to what it terms Rich Geometry, which includes support for mathematical curves in FME 2007, it was time to port Curvefit into the Curvefitter extension to FME. The extension adds a new Curvefitter Transformer to the product's long list of transformers.

There is one caveat in using Curvefitter worth noting before going any further: there's no advantage to optimizing geodata with curves if their ultimate destination is a format that can't store curves! (Shapefiles are one example.) But for nearly all other stroked datasets, there will be benefits to Curvefitter data optimization that can be measured aesthetically, in size and in accuracy.

How Curvefitter Works
Simply put, Curvefitter examines long lines made up of many segments, called polylines in AutoCAD and other names, including linestrings, in other programs, and determines a "best" combination of curved and straight segments to represent them. To do that, the software takes into account three key goals: compression (how important is file size reduction?), smoothness (how important is overall smoothness of the line?) and accuracy (how important is it for the output to match the input?). Casual users can use default settings and Curvefitter's built-in fuzzy logic will balance the three. Those who need more control for a specific goal can set parameters one at a time, giving each one a different weight (Figure 1).

The user also sets two parameters that control the nature of the output lines and curves. Precision is the most important and defines how far an existing vertex in the dataset can be from the linework in the output dataset. Keep the value small (in units of the data) and the resulting line will follow the existing vertices closely. Make the value larger and the resulting line will be allowed to run above or below the final polyline, up to the amount specified in precision. Flattening determines when relatively flat curves are replaced with straight lines. A curve with a mid-ordinate (a measure of curvature) below this value will be turned into a straight line.

Figure 1: Dialog box for setting Curvefitter Parameters. (Click for larger image)

Putting Curvefitter Through its Paces
In the figure below, the line in the middle is defined by 59 vertices (shown on top). If you counted, you'd find that 57 short segments make up the line. Curvefitter optimized the line by representing it using just three vertices, that is, just two curved segments (bottom). The black squares show their start and end points. You'll note that the resultant line closely, but not exactly, follows the centers of the vertices.

Figure 2: Stroked line (middle), vertices (top) and Curvefitter-optimized line with just two arcs (bottom).

The real power comes when Curvefitter tackles large "real life"-sized datasets. Let's start with some parcel data. Grays Harbor County in Washington State offers its data for free on the Internet (and provided permission to use it for this article). A subset 6.1 MB shapefile was extracted (Figure 4a). That was converted into a DWG file (3.8 MB), MapGuide SDF (4.5 MB), ESRI personal geodatabase (5.4 MB) and file geodatabase (available in ArcGIS 9.2, 1.89 MB) using Safe's Feature Manipulation Engine (FME) core tools.

The next step was to run Curvefitter. FME and non-FME users can tease out the process in this FME workspace (Figure 3). A workspace allows FME users to create, edit and save such procedures either to run in the future or imbed in other workspaces.

Figure 3: The FME Workspace used in the examples in this article. (Click for larger image)

The Curvefitter precision parameter for this example was set very high, at 0.1, which in this case means 0.1 feet. (The value is always in the native units of the data.) In English, that means that each newly created vertex can be no more than 1/10 of a foot from the original line it represents. Said another way, the vertices are on a tight leash and must stay "very close" to the original linework.

It's also worth noting that Curvefitter can take advantage of existing FME tools to maintain shared boundaries (or not) after optimization depending on user need. For parcels and other data sets with adjacent polygons, it's most likely that users will want shared boundaries to be maintained.

Zooming in on the original data (Figure 3b) you can see there are many, many vertices making up the curved sides of the parcels. After running Curvefitter (Figure 3c), each parcel's curved sides are saved as single arcs, with a few exceptions.

Figure 3 a) Raw data from Grays Harbor County, Washington. (Click for larger image)

Figure 3 b) many vertices that make up the original data. (Click for larger image)

Figure 3 c) fewer vertices and true curves after Curvefitter processing. (Click for larger image)

What does that mean for file statistics? The AutoCAD DWG shrank 137% to 1.6 MB. The SDF3 file shrank 181% to 1.6 MB. The ESRI personal geodatabase shrank 12.5% to 4.8 MB. ESRI's new ArcGIS 9.2 file geodatabase shrank 77.5% to 1.09 MB.

The reduction is calculated by using the following formula: % reduction=[Original Size - New Size] / New Size X 100. Or in English, the value is the ratio between what was removed to what is left x 100. A 300% reduction means three parts removed, to one part left, or the resulting file is to 1/4 of the size of the original. A 200% reduction would be two parts removed for one part left; the resulting file is 1/3 the size of the original. Table 1 includes all of the "before and after" values from each example in this article.

Table 1: Sizes and percent reduction of example files before and after Curvefitter optimization. (Click for larger image)

From a visual standpoint the curves in the output datasets will remain curves, no matter how much the viewer "zooms in." Further, when the parcel map is printed on paper, the parcel's curved edges will appear smooth.

Contour lines create notoriously large CAD and GIS files. Depending on their source, they can also have odd "spikes" and "points" that rarely reflect the real world's topography. See, for example, the bottom-most contour in Figure 4a. This contour line data set originated as a 35.9 MB DWG which spawned, using FME, a 50.8 MB SDF3, a 126.7 MB E00 file, a 56.4 MB personal geodatabase, and a 17.3 MB file geodatabase. After Curvefitter processing (Figure 4b), the DWG dropped 360% to 7.8 MB, SDF3 260% to 14.0 MB, and the personal geodatabase 139% to 23.6 MB. The file geodatabase shrank 65% to 10.5MB. In this case a more moderate precision of 1.0 foot was used, allowing the final vertices to stray up to a full foot from the original linework.

Figure 4 a) contour lines before Curvefitter. (Click for larger image)

Figure 4 a) contour lines after Curvefitter.. (Click for larger image)

Other large datasets GIS professionals work with are regional or countrywide geology maps. These define the areas of different types of surface or subsurface formations and often have many complexly organized polygons that are ultimately rendered in multicolor thematic maps. The Natural Resources Canada website makes datasets for the country publicly available for download. FME processed the 46.3 MB shapefile (Figure 5a) to create a 28.5 MB DWG, a 47.8 MB SDF3, a 59.6 MB ESRI personal geodatabase and a 43 MB file geodatabase. Here the parameters were adjusted to focus on smoothness and accuracy, and precision was dropped to 20 feet. The AutoCAD file dropped by 391% to 5.8 MB, the SDF3 265% to 13.1, the personal geodatabase 242% to 17.4 and the file geodatabase 378% to 9.0 MB. The integrity of the many short segments in the original data (Figure 5b) was maintained, as was the topology, in the Curvefitter output (Figure 5c).

Figure 5 a) Raw geologic data from Natural Resources Canada. (Click for larger image)

Figure 5 b) detail of the linework. (Click for larger image)

Figure 5 c) optimized linework after Curvefitter; note that a vertex at an intersection was maintained during optimization. (Click for larger image)

Timing is Everything
Curvefitter comes at the right time for those involved with geospatial data. In the early days of GIS, data collection was the key focus. Today data are abundant, though they bear the legacy of being "well processed." Many of these datasets can be optimized, trimmed down and smoothed out, and as a result become more accurate. Once that's complete, they can be stored in a variety of geospatial data formats that support geometric curves, something not widely available in the past two decades.

It's worth noting from these examples that optimizing shapefiles into ESRI's new file geodatabase yielded impressive results: size reductions on the order of 6:1 using Curvefitter. Further, of all the formats tested, the file geodatabase is the most efficient in terms of file size. (There are other benefits of this format including a 1 TB size limit and cross operating system support.) No matter what the format, smaller optimized files will most definitely play a key role in today's new world of data sharing. Smaller GIS data files, whether purchased online, downloaded free from governments, or sent to mobile devices are far easier to distribute than larger ones. Curvefitter may be the best thing to come along in geospatial data optimization in a long time.