Making Your Data ‘Stuff’ Work For You

By Lisa Flint

SAS Demographics Remember George Carlin's classic routine about 'needing a place for his stuff'? Stuff is everywhere - often more than we can manage.This is particularly true in today's information age, where we're bombarded with tips, insight and news through word-of-mouth, email, phone or fax.Information is constantly available, but how valuable is it if we can't gain knowledge, and therefore wisdom, from it? Information delivers no value if not used properly AND if the information is wrong.

Information comes from data, and wisdom comes from information.Wisdom - about customers, citizens, members - is the biggest asset to any organization, institution, association or agency.But it can only be achieved if, and only if, the information being received and processed is complete, accurate, correct, and delivered in a way that is intuitive and available when needed.

Data Extraction ImageIt's impossible to quantify the amount of time and money that has been spent over the past 20-plus years on gathering and processing data."If I had a dollar for every piece of bad data I received ...We'd all be rich!" Quality information can only be achieved through two things: good data and reliable technology.There's data everywhere.Once you have it, you need to store it and to be able to access and accommodate data from other sources.If you think about the six degrees of separation, you have a pretty extensive link to data from all over the world.

While we have figured out how to address some of the logistical challenges to efficiently storing and accessing data, and then proactively asking questions of that data to gain insight, we're still struggling with the quality of that data - and also the speed by which we can get reliable answers to questions we can ask of the data.After all, we've all heard of "real-time" or "right-time" information.

Data integration technologies such as ETL (extract, transform and load), data quality and geo-coding are keys to helping solve these new problems, as well as improving on existing challenges.ETL involves accessing data, transforming the data so that the metadata is consistent and then placing that data in a selected location.Data quality is the step that ensures your data is clean and accurate, and that you have removed any duplication.Address standardization as a part of the data quality process provides benefits of lower costs and higher accuracy for mailings and sets a basis for geo-coding.Geo-coding enhances data by "coding" it with location-based (or geographic) references such as latitude and longitude values to map the location or to obtain demographic data for that location.Geo-coding is also critical in that it provides a link for spatial analysis and visualization.

SAS Fairfield ImageHow can these three technologies make such a big impact in a company's success? A good example is the Ohio Department of Natural Resources (OHDNR), which needs to be able to predict lapses in its hunting/fishing license renewals.OHDNR's Wildlife division was the first to try and predict such customer behavior.OHDNR started the process using SAS data integration and data mining technologies to create a simple data warehouse where data could be stored and analyzed.Analysts very quickly realized the data they collected was "dirty" and redundant, which skewed the results of the data mining efforts.Because hunting and fishing licenses can be renewed many ways -- through the mail, by phone or through stores selling fishing and gaming supplies -- the opportunity for data redundancy was high.There could be multiple instances of a single person with slight variances on a name or address.

By using data quality technology from DataFlux, a wholly owned subsidiary of SAS, OHDNR was able to better identify and clean up duplicate and redundant data.DataFlux focuses on developing technology that employs the concept of fuzzy logic, allowing software to identify multiple versions of the same person.OHDNR's cleansing processes include:
  • Address verification/standardization
  • Name standardization
  • De-duplication of customer records
Fuzzy logic utilizes a rules-based engine that uses parsing rules, standardization rules, phonetic matching and token-based weighting to eliminate ambiguity of source data.After applying hundreds of thousands of rules to every field, the engine returns a "match code," which is an unambiguous value representing the ambiguous source data. The concept of "Sensitivity" allows the user to set the desired 'closeness' of the data; the higher the sensitivity the 'closer' the data needs to be to result in a 'match'.

Using out-of-the-box address parsing rules, the match engine accurately identifies the meaning of each token within the address string.For example the engine breaks down the following address:

Address
100 N Mane St, Floor 12
100-12 North Main
#12, 100 No.Main Street

Into the following tokens:

Street Number
Pre-Direction Street Name Street Type Address Extension Address Extension Number
100
N
Mane
St
Floor
12
100
North
Main


12
100
No
Main

#
12

The underlying rules engine is called the Quality Knowledge Base.This shared metadata repository is the main point of integration into SAS' core language as well as SAS' ETL technology.The core parsing, matching, standardizing and identification functions are embedded in SAS and point to the same Quality Knowledge Base described.The Quality Knowledge Base includes rules for:
  • Address
  • City
  • City, State/Province, Post Code
  • Email
  • Phone
  • Website URLs
  • Date
  • Name
  • Two names
  • Account number
  • Date time stamps
  • Company names
  • Phone numbers
  • State/Province
  • Post codes

The Quality Knowledge Base is fully customizable, allowing users to create simple to complex custom parsing, standardization, matching and identification rules through an easy-to-use point and click interface.

To find matches of people, simply add additional attributes (company name, individual name) so the match engine will find matches based on multiple criteria as shown in the following match report.The match engine can also use Boolean logic to AND/OR attributes to group data according to like attributes going beyond identifying individual duplication.The report below shows the sophistication of the match engine:

Duplicate Table

An important part of understanding OHDNR's customers is having greater insight into where each customer resides geographically.How far is a customer from bodies of water or designated hunting areas? Are customers frequenting bodies of water near or far from their homes? What would the customer base look like if mapped visually? To gain this knowledge, OHDNR used geo-coding technology, which reads address information and returns new data such as latitude and longitude, census block, census group and census tract data.The latitude and longitude are then read by mapping software to create a map of the customer base.OHDNR uses the census group number to get additional demographic, such as average household age, income etc.data from the Census Bureau.

Now OHDNR is able to easily collect, clean and enhance their customer data and successfully predict nine out of ten times when a hunting or fishing license will lapse in the renewal process.In the future, they plan to include data from two more divisions -- Parks and Recreation, and Watercraft -- and use geo-coding to get an even broader, more in-depth view of their customer base.This additional data will help OHDNR understand that four fishing licenses in a single neighborhood were not renewed because the only person in the neighborhood with a boat moved.

While we may have more "stuff" than ever before, we're no longer at its mercy.With sophisticated data integration technologies, we can get the most out of all of our data with less effort and make more confident decisions because we know we have a complete, reliable picture of a situation, such as customer profile, supplier activity, inventory status.By making our data work for us, successes like those achieved by the OHDNR are within our grasp.

Published Thursday, October 27th, 2005

Written by Lisa Flint



If you liked this article subscribe to our newsletter...stay informed on the latest geospatial technology

© 2016 Directions Media. All Rights Reserved.