Vector Steganography

A practical introduction

William A.Huber, Ph.D.
Quantitative Decisions
Merion Station, PA

See also GIS & Steganography Part 1 & Part 2 by Bill Thoen.

Introduction
Purpose
Applications
Jittering
Embedding
The Problem
Communications theory
The Challenges
More about embedding
The remaining difficulties
Conclusions

Sidebars

Jittering: For those who want to know more about how jittering is done.

Huffman Codes: An elegant way to compress messages.

Error Correcting Codes: Adding redundancy to messages so that random errors and bursts of errors in transmission can be corrected.

Disguising a Message: Making a message appear random without ruining it.

Bill Thoen writes that steganography is any "technique of hiding secret messages within other media, so that to all eyes but those of the sender and intended receiver, no secret message appears to be present."

Introduction

There are plenty of ways to hide messages within images. This is relatively easy because an image, being an array of pixels, typically contains an enormous amount of redundant information. If you have ever compressed a Windows bitmap (.bmp) file, for instance, you will have noticed that 90 to 99% of the file disappeared. You can hide information in that extra space.

A bitmap and other common image files are in raster formats: arrays of colored dots. This article focuses on the more challenging problem of reliably hiding messages within vector data. Vector formats are used extensively by Geographic Information Systems. They use sequences of coordinates to represent points, lines, and regional boundaries on a map.

Why worry about the vector format when hiding messages in a raster format is so easy? One application is protecting intellectual property. When you have worked hard to produce GIS data, once it is released in an electronic format, it's easy for somebody to copy it and re-use it without your permission (or compensating you, either). By "watermarking" vector data with a hidden copyright message or signature you stand a chance of detecting and demonstrating commercial re-use of your data.

Purpose

Certain vector steganography techniques, as an incidental byproduct, will also support change detection, data compression, integrity checking, and integrated metadata. Change detection lets you know when a feature has been modified. Data compression reduces the space needed to store information. Integrity checking verifies that the data are correctly stored and transmitted. Metadata is information about the data, such as how it was developed, processed, projected, and so on. I find the idea of hiding key parts of the metadata, such as accuracy and projection information, in the figures themselves to be pretty cool.

Applications

Thoen describes several vector steganography techniques. This article explores two of them, which I will call "jittering" and "embedding." (This is a new subject, so I am not aware of any standard terminology and have to make it up as I go.)

Jittering consists of making tiny changes in the vector coordinates. These can be detected if all coordinates are rounded off in advance. For instance, if you encounter a sequence of numbers like 3.142, 2.783, -1.000, and then suddenly see 5.9265358979324, it is obvious something is going on. The information is contained in the extra digits. Because those digits have low numerical significance--in the example they would not change any single value by more than 0.001--their introduction does not alter the number in any material way. Usually. (We will get to the exceptions and problems later.)

Jittering

(Follow the link for detailed technical information and source code listings.)

Close-up of the central Mexican states

Strongly jittered versions, showing how the vertices move

Normally, jittering does not move vertices this much. The motion has been greatly exaggerated to make this illustration.

A full-featured implementation would respect the topological relationships among these polygons, so that the gaps and overlaps would not occur. The vertices would still move, however, just as shown here.

Embedding is a technique I developed to overcome some of the obvious problems with jittering. Embedding consists of using the distances between points to convey information. Usually it is accomplished by adding extra points to the description of a vector figure. Since the points lie on the figure itself, they do not change how it looks: they only change its internal representation. The inter-point distances can transmit messages in many ways. I discuss the choice of a good method below.

Embedding

In GIS applications, the greatest challenge for vector steganography is coping with the effects of routine operations. To relate vector data to real-world features, a GIS must "georeference" its figures. Georeferencing operations include moving the figures around, rotating them, changing their scale, projecting them (from the earth's surface to a flat map), and "unprojecting" back again. A GIS may also modify figures to make them topologically consistent and "clean." For instance, the orientation of vertices might be reversed. Often this is done automatically. All these operations usually introduce huge changes in coordinates, thereby destroying any information contained in their least significant digits.

The Problem

To cope with this, we need ideas from communications theory. From this point of view, steganography is just another "channel" used to communicate a source message to its destination. Any source message can be viewed as a sequence of letters from an alphabet. On modern digital computers each sequence ultimately is a binary string of zeros and ones: the simplest possible alphabet, with two letters only. As these letters are sent over the channel, they can be unpredictably changed. Therefore something is needed to encode the source message before it is sent and to decode it once it reaches its destination, in a way that is relatively immune to changes made while in the channel.

(More about error-correcting codes appears at the Error Correcting Codes page.)

Communications theory

Claude Shannon developed this branch of mathematics in 1948. Its practical applications were quickly recognized.

For more information, see Robert McEliece, The Theory of Information and Coding, 1977. Addison-Wesley, Reading, Mass.

We can expect to pay a price for the ability to detect and correct channel errors: the encoded source has to be redundant, so it will expand in size. This is usually not a problem with raster steganography, because a raster image is such a fat channel. The space for stuffing information into a vector channel is so limited, however, that we must anticipate the need to compress the source. Fortunately, many good source compression techniques are known and readily available.

My ArcView software provides Huffman compression simply to show it can be done within the GIS. Commercial products, using more sophisticated techniques, will do a better job of compressing messages.

Chances are, a vector channel will be public: in many of the applications, anybody could have access to our GIS data. We will want to make our message secure, so that it cannot be understood even if it is detected in the channel, and we might want to authenticate it, so that the intended receiver can verify it came from us. Again, many good techniques for security (encryption) and authentication (digital signatures) are known and readily available.

Rivest, Shamir, and Adleman published their seminal paper on this subject in the Communications of the ACM in 1977. See http://www.rsasecurity
.com/rsalabs/faq/ for more information.

From now on, therefore, we may just as well assume the source message has been digitally signed and encrypted and that the received message might have to be decrypted.

The remaining challenges are unique to the vector channel. First, since this is supposed to be hidden communication, it would be nice to disguise the changes made, however small they might be. Making them look random would work. At first blush, this is not a problem, because any well-encrypted source is already going to look very random, and it will stay random even when converted to binary. However, the randomness will be lost by the error-correction encoder. The process of making a message redundant introduces a lot of structure--evident patterns--into the signal. This might be easily detectable by software that probes vector data for possible hidden messages. Therefore a special randomization step is needed just before the signal enters the channel (that is, just before it is written into the vector figures). For details please see Randomizing a Message.

The Challenges

I accomplish the "special randomization" by masking the message bits with a deterministic sequence of pseudo-random numbers using a decent generator. The same procedure conveniently serves to reverse the process when reading the message.

Second, and this is the most difficult part, is that the kinds of changes that can happen to a message embedded in a vector figure are different than those that communication theory knows how to cope with. Most communications channels are pipes of some sort-- telephone wires, network connections, radio signals, and so on--that convey the message bits. Along their way, these peripatetic bits encounter monsters of the ether, such as noise, cosmic rays, and so on, which prey on electric signals, sometimes changing them or even obliterating them beyond recognition. Usually, these changes occur pretty randomly, or at worst in random bursts.

On the other hand, a vector figure might be moved, reprojected, get sliced up, merged with another figure, reversed in orientation, or even have a few vertices added or removed. You might argue that by then the figure is hardly worth protecting. Well, maybe. If you try to read a message out of a figure and find it obliterated, at least you have detected a change. But in other applications, such as protecting intellectual property, you probably don't want relatively minor or routine alterations to be considered "new works." If you have accurately surveyed a pipeline or road map, for instance, you should be able to protect your rights to its shape, and guarantee its accuracy, regardless of what projection it is mapped in. Moreover, it would be nice to protect even individual portions of the line.

Map-makers purposely introduce tiny errors, such as "trap streets," into hardcopy maps to obtain copyright protection. Vector steganography can serve to make invisible "trap streets" with the same effect..

In such applications, jittering just won't cut it. This is what embedding was invented for.

More about embedding

The key to embedding is to consider what remains after you move, project, and otherwise slice and dice a figure. Although the coordinates change, and distances, angles, and areas change, the shape approximately remains the same.

Here's a fanciful way to embed a message within a vector figure, the "handwriting" technique:

(I have not tried too hard to hide the message, so you can see what's going on, but in principle it could be shrunk to invisibility without trouble.)

Even after unprojecting and reprojecting with a very bad projection, we can recognize the message:

I am not seriously proposing that anybody should actually use this method! The disadvantages listed below essentially rule this out as a practical technique.

These figures show each segment of the outline in a different (random) color so you can see where the vertices are.

The original figure consisted of 112 segments. Embedding the 10-letter message created a figure with 853 segments. The figure's boundary now has some self-intersections, which isn't very good, either.

This handwriting technique, although somewhat silly, has most of the characteristics we would like vector steganography to have:

	It is straightforward to encode messages into the channel and decode them again.
	Messages will survive heavy abuse in the channel, including moving, rotating, rescaling, skewing, reprojecting, and even some random vertex movement, insertion of vertices, and chopping of the figure into not-too-small pieces.
	Messages can be hidden by making them extremely small compared to the rest of the figure.

The disadvantages of handwriting, though, are important:

Jittering has potential self-intersection problems, too: if two vertices are very close, then jittering can twist up the figure. This problem can be handled by jittering only the least significant digits in the coordinates.

	It is inefficient: a large number of new vertices must be introduced to transmit each character. (If this page took a while to load, the previous figure is the reason why!)
	It is easy to detect and obliterate: any program that "cleans" figures by eliminating slivers or self-intersections will accomplish that.
	It ruins the shape by introducing self-intersections; this can be a problem for subsequent geographic analysis carried out in software.

What I mean by "embedding" is any variant of handwriting that overcomes these problems while maintaining the good characteristics.

The approach that appears the most promising either doesn't change a figure's shape at all, or it changes it in a natural, undetectable way. In its most sophisticated form ("splined embedding") it is completely undetectable and, as a side-effect, can even improve the representation of many geographic features.

The simplest form of embedding, as I propose it, is to insert sequences of points along the line segments that form a vector figure. The first point establishes a reference length. The distances to subsequent points will either equal the reference length or exceed its length by some factor (which I call the "strength" of the embedding). For example, if the reference length is four inches and the embedding strength is 1.0, then the factor used is 1 + strength = 2.0 and the lengths will be four inches or eight inches (2.0 times four inches).

These are world inches, not inches on the map!

The lengths will correspond to bits in the signal: a long length for a 1, a short length for a 0. Every message will be preceded by the reference length (encoding a zero) just to get things started.

Decoding an embedded message allows for some "slop" in the relative lengths. This is what protects the message from changes brought about by GIS operations. Let's suppose that somehow you can find the beginning of the message. The first length is interpreted as the starting 0. Subsequently, any large increase in the next segment length encountered is interpreted as a 1 and any large decrease in length is interpreted as a 0. By focusing on increases and decreases, the decoder does not depend on the exact preservation of relative lengths.

I am skipping over an important detail here. It's not so easy to find the beginning of the message! The solution is to bracket the message by a specific sequence of bits that is highly unlikely to appear by accident. These "sentinels" show where the message begins and ends. Between 24 and 48 bits should suffice for each sentinel.

As a tiny example, let's send a message consisting of an ASCII "A". Its binary value is 0100 0001. The embedded message must begin with a zero, so we need to encode the sequence 0 0100 0001. Let's use the longest side of the figure for the embedding, so that we can keep the reference distance reasonably large. (Super-tiny segments in a figure might give away the presence of a message.) Suppose this figure is part of a road map and the longest segment is 110 meters. Using a strength of 1.0, we would choose a reference length of 10 meters and encode the zeros as 10-meter segments and the ones as 20-meter segments. Here's what the embedded message looks like:

The yellow (end) vertices were originally in the figure. The blue ones have been added to transmit the binary signal. They will not be visible when the figure is drawn.

If this figure gets distorted, the message can still be read provided the relative lengths along the message do not change by too much. The higher the strength of the embedding, the more resistant the message becomes to such distortions.

There are plenty of problems with the simple method just shown, but most of them can be overcome.

Problem: It's pretty wasteful to devote one vertex to each bit of the signal: the two coordinates of a vertex require between 64 and 128 bits, depending on the GIS. Solution: We aren't limited to just two lengths. For instance, we could use lengths of one, two, four, eight, 16, 32, 64, and 128 times the reference length to squeeze three bits of information into each new vertex. We can get far more than that by using weaker strengths. For example, a strength of 0.05 would be good enough to resist the effects of most reprojections; using lengths up to 25 times the reference would let us put eight bits of information into each new vertex.

The ArcView software I wrote does not use multiple lengths, and so is quite inefficient. It does allow you to specify the strength.

Problem: Embedding looks like it would be easy to detect, because it creates sequences of vertices with perfect 180 degree angles. Solution: Instead of putting the new vertices along a straight line between two original vertices, put them along a nice smooth curve, such as a spline. This solution would look very much like the result of a common, innocuous GIS operation, "densification," used to improve the appearance of figures.

The ArcView software does not provide a splined solution.

Problem: Embedding might require extremely tiny reference lengths. For instance, a detailed outline of the state of Texas requires over 1000 points, of which the longest segment is only a few kilometers. A moderate-size message might require 1000 or more bits, forcing the reference length down to a few meters or less. Solution: You don't have to limit the embedding to a single side of the original figure. You can turn corners, either by incorporating the original vertices into the embedded signal or by using special lengths to signal when the original vertices should be skipped, if they lie in the wrong position.

The ArcView software does not attempt to go around corners. It limits the embedded message to the longest side of the original figure.

Problem: Any automated procedure for "thinning" or "weeding" unnecessary vertices will obliterate any embedded message. Solution: Splined embedding will prevent the weeding of redundant vertices (those with 180-degree angles). However, an aggressive weeding routine will eliminate various vertices even along splines. There is not much we can do to defend against that, except to note that weeding reduces accuracy and cartographic quality, and so constitutes a fundamental change in the figure. One can do only so much to protect a hidden message!

A robust, practical implementation of vector steganography will have many steps. Here is a guide:

The "encrypt and sign" and "decrypt and authenticate" operations are grayed out to indicate I have not implemented them in the ArcView prototype.

Notice the asymmetry between the writing (top half) and reading (bottom half): to maintain a perfect disguise and to make error-correction effective, the sentinels have to be added before these operations are performed. On the other hand, the sentinels are used to locate the message within the vector figure, which has to happen before any other part of the reading step. How are the sentinels found? By inspecting the figure point by point, undisguising and decoding a small number of bits at a time to see whether they match the sentinel.

This asymmetry is something that writing the prototype taught me. Unfortunately, I figured it out too late to change the design: the prototype adds the sentinels just before embedding or jittering. This is not as good a solution..

There remain some potential difficulties with any kind of vector steganography, whether it is implemented by jittering, embedding, handwriting, or some other means. Let's consider a few of these.

The remaining difficulties

First, steganography is not built in to any GIS. So how does one acquire this capability? The obvious solution is to use the built-in scripting capabilities found in any reasonable GIS. However, scripts usually run slowly--sometimes hundreds of times slower than compiled programs--and steganography is a complex process requiring a lot of computation. This is a legitimate concern, which is why Thoen scripted a jittering solution in MapBasic (for the MapInfo GIS) and I have scripted (in Avenue for the ArcView 3.x GIS) a comprehensive solution including source compression, error-correcting coding, and message encryption, implementing both jittering and a simple form of embedding. Both systems are prototypes and neither is perfect, but they serve as proof of concept and address some of these practical issues. In particular, the Avenue solution is robust enough to go to work now and its execution speed is reasonable: typically it takes less than a second to write or read a 254-character message in a moderately complex feature of several hundred vertices.

The execution time depends on what features you are using. The speed quoted here is for compressing, adding error correction, burst error correction, and randomizing the message.

Second, GIS datasets often get reformatted. Conversion from one format to another can completely destroy a jittered message, as Thoen and I discovered in one early test. Fortunately, an embedded message survives reformatting easily.

Not all conversions are this bad, but some can be.

Third, most GIS datasets contain information about how neighboring regions abut each other or how portions of networks attach to each other. Changing coordinates willy-nilly, as in jittering, can destroy these topological relationships. Although our prototypes have not accounted for this, they readily could be using well-known techniques, such as representing polygons as sequences of arcs shared with neighboring polygons. This would eliminate the gaps and overlaps evident in the jittering illustration above and would make the splined embedding solution truly workable.

There now exists software for MapInfo and ArcView 3.x that will let you hide messages in vector data and read them back again. This has many potential applications, including copyrighting, watermarking, data compression, change detection, authentication, and integrity checking. The algorithms can be adapted for any vector GIS and improved to create a truly robust, flexible, efficient system that provides all these benefits.Click on this link HERE for a dowloadable (100K) ArcView 3.2a project file.The project contains two dialogs and a lot of scripts, nothing else.It is guaranteed virus free.(It is not, unfortunately, guaranteed bug free!) The sidebars to the article above provide a lot of guidance for understanding these scripts.Eighteen of the scripts are test drivers that exercise most of the core routines implemented in the other scripts.Running and studying these would be a good place for the interested reader to begin.The scripts have all been protected (and indemnified) with a GNU copyleft, as available at www.gnu.org/copyleft/gp1.html .(It would be the first readable text you encounter if you open the project file in a word processor.)