Data Quality: Before The Map is Produced

When producing a map, how much thought is your organization putting into the preparation of data before it is displayed? A dazzling map can easily be presented with totally inaccurate information. Why? Because inherent data quality issues within an organization can mean the difference between meaningful maps and a simple graphic or poster.

The Data Warehousing Institute estimates that data quality problems cost U.S.businesses more than $600 billion a year.Yet, most executives are oblivious to the data quality lacerations that are slowly bleeding their companies to death. More injurious than the unnecessary printing, postage, and staffing costs is the slow but steady erosion of an organization's credibility among customers and suppliers, as well as its inability to make sound decisions based on accurate information.

Larry English, a leading authority on data quality issues, writes, "...the business costs of non-quality data, including irrecoverable costs, rework of products and services, workarounds, and lost and missed revenue may be as high as 10 to 25 percent of revenue or total budget of an organization."

What Is Data Quality?
Data Quality Attributes. We have talked about the importance of data quality and its business ramifications, but we still need to define what it is.Data quality is not necessarily data that is devoid of errors.Incorrect data is only one part of the data quality equation.

Most experts take a broader perspective.Larry English says data quality involves "consistently meeting knowledge worker and end-customer expectations." Others say data quality is the fitness or suitability of data to meet business requirements.

In any case, most cite several attributes that collectively characterize the quality of data:

Accuracy: Does the data accurately represent reality or a verifiable source?
Integrity: Is the structure of data and relationships among entities and attributes maintained consistently?
Consistency: Are data elements consistently defined and understood?
Completeness: Is all necessary data present?
Validity: Do data values fall within acceptable ranges defined by the business?
Timeliness: Is data available when needed?
Accessibility: Is the data easily accessible, understandable, and usable?

The first five attributes generally pertain to the content and structure of data, and cover a multitude of sins that we most commonly associate with poor quality data: data entry errors, misapplied business rules, duplicate records, and missing or incorrect data values.

Duplication of data is probably the most serious issue for the mapping of data. Duplicate records result in the "stacking" of data points when displayed on a map. Someone who is using this mapped information to make important financial and marketing decisions can make some very unrealistic interpretations. Inaccurate data results in inaccurate targeting, budgeting, staffing and financial projections. If one customer is being represented multiple times in a database and being mapped as such, serious mistakes can be made.

But defect-free data is worthless if knowledge workers cannot understand or access the data in a timely manner.The last two attributes above address usability and usefulness, and interviewing and surveying business users of the data can best evaluate them.

Defect-Free Data Is Not Required.It is nearly impossible to ensure that all data meet the above criteria 100 per-cent. In fact, it may not be necessary to attempt this Herculean feat.Data does not need to be perfect.It simply needs to meet the requirements of the people or applications that use it. And different types of workers and applications require different levels of data quality.

Data Quality Solutions
Although not all vendors offer all features listed above, most offer the following standard features:

Data Auditing.Also called data profiling or data discovery, these tools or modules automate source data analysis.They generate statistics about the content of data fields. Typical outputs include counts and frequencies of values in each field; unique values, missing values, maximum and minimum values; and data types and formats.Some of these tools identify dependencies between elements in one or more fields or tables, while others let users drill down from the report to individual records.

Parsing.Parsing locates and identifies individual data elements in customer files and separates them into unique fields.For example, parsers identify "floating fields"-data elements that have been inserted into inappropriate fields-and separate them.For example, a parser will transform a field containing "John Doe, age 18" into a first name field ("John"), last name field ("Doe"), and age field ("18").Most parsers handle standard name and address elements: first name, last name, street address, city, state, and zip code.More sophisticated parsers identify complex name and address elements, such as DBA (doing business as) or FBO (for the benefit of).Newer parsers identify products, email addresses, and so on.

Standardization.Once files have been parsed, the elements are standardized to a common format defined by the customer.For example, the record "John Doe, 19 S.Denver Dr." might be changed to "Mr.John Doe, 19 South Denver Drive." Standardization makes it easier to match records.To facilitate standardization, vendors provide extensive reference libraries, which customers can tailor to their needs. Common libraries include lists of names, nicknames, cardinal and ordinal numbers, cities, states, abbreviations, and spellings.

Verification. Verification authenticates, corrects, standardizes, and augments records against an external standard, most often a database.For example, most companies standardize customer files against the United States Postal Service database.

Matching.Matching identifies records that represent the same individual, company, or entity.Vendors offer multiple matching algorithms and allow users to select which algorithms to use on each field.There are several common algorithms: (1) key-code matching examines the first few characters in one or more fields; (2) soundexing matches words by their pronunciation; (3) fuzzy matching computes a degree of likeness among data elements; and (4) weighted matching lets users indicate which fields should be given more weight.

Consolidation/Householding. Consolidation combines the elements of matching records into one complete record.Consolidation also is used to identify links between customers, such as individuals who live in the same household, or companies that belong to the same parent. The above capabilities can help an organization boost the accuracy levels of its customer files into the high 90 percent range, especially if these processes are automated to run against both source files and downstream systems on a regular basis.

Taking all of these factors into consideration, it becomes very clear that a dot on a map is not just a dot. It represents the end result of a very important process that has to be performed before the map is produced- the process of data quality.
The best mapping technology in the world is only as good as the data you feed it.