A Primer on How to Create a Customized Segmentation System

What is a segmentation system?

Segmentation systems represent gathering individual objects such as customers (customer segmentation), markets (market segmentation) or neighborhood (geodemographic segmentation) into groups called segments.A segmentation system is created through the process of clustering, also known as cluster analysis, where similar objects are grouped into homogenous clusters (segments).These clusters should be as different from each other as possible.

Segmentation systems provide synthetic information about objects by focusing on differences in demography, lifestyles, consumer behavior, etc.Geodemographic segmentation systems are usually based on such variables as age, housing, family status, income, education, ethnicity, mobility, labor force, religion and other census variables.Such systems assume rather homogenous characteristics for all customers within a given geographical neighborhood.Lifestyles segmentation systems include additional data derived from various surveys on preferences and lifestyle behavior, related to autos, journals/magazines, TV, leisure, food and drinks, credit cards, etc.Customer segmentation systems refer more to individual customers than to geographic neighborhoods and are usually based on proprietary survey data about customers.

Creating segmentation systems

All segmentation systems are created using a similar methodology.They are based on a large number of input variables coming from various data sources.Variables in a final list are known as the "diagnostic" ones.Basic steps used in creating segmentation systems include:

Selecting input variables
Reducing their number
Determining the number of segments
Agglomerating objects (customers, markets, geographic neighborhood) into segments
Providing summary information about each segment, naming segments and determining which object is the most typical for each segment
Mapping segments and linking them to other data sets.

Statistical software packages such as SPSS, SAS, S-PLUS or data mining software such as Clementine are required to perform some of the steps listed above, especially to reduce the number of variables, measure similarity between objects, and agglomerate them into segments.

Why use customized segmentation systems?

There are many commercially available segmentation systems. They have been created in many countries for different levels of geographic aggregation and using a variety of data sets.The first segmentation systems were created in early 1970's.Emerging GIS technology caused the dominance of a geographic aspect in such systems.Recent trends in segmentation systems focus more on specific target groups (micromarketing).

Examples of the most common commercial segmentation systems available today are: MOSAIC (available for more than 20 countries and targeting more than 800 million customers), PRIZM, PSYTE, ACORN and many others. These systems were created by companies with experts from marketing, demography, geography, statistical analysis and data mining.Commercial segmentation systems are often based on data sets not publicly available.They represent the state of the art and are updated regularly; usually once a year or even every six months.

However, they have some serious drawbacks.First of all, these systems are expensive, costing thousands of dollars.For some small businesses this factor can be a real constraint.Second, all of them are general purpose systems, not oriented toward individual businesses or particular smaller geographic neighborhoods.These systems can serve any type of business located within any neighborhood, but business owners might be particularly interested in using specific type of input variables, including their own databases about customers.So in some cases, it might make more sense to create a customized segmentation system.Customized systems are not just low budget alternatives to commercial segmentation systems.Both types of systems can complement each other.

Access to a statistical package, some expertise in using statistical methods and a GIS are necessary for creating a customized segmentation system.These obstacles can be overcome by outsourcing some tasks to professional consultants.

Diagnostic variables and their properties

The diagnostic variables should be:Relevant
Geographically differentiated
Not interrelated

Input variables should be relevant to the given type of business, they should be available at the desired geographic level of aggregation and additionally, they should have some other critical properties. For example, for a system customized to banking services, such potential diagnostic variables could include: daytime population, disposable and discretionary income, distance from the centroid of a given geographic polygon to the closest bank or banking machine expressed in driving time, consumer spending potential on such categories as government retirement and pension funds, income tax, tax refunds, employment insurance premiums, gifts to charitable organizations, purchase of automobiles and trucks, property taxes, mortgage interest, etc.

In addition to being relevant, the diagnostic variables should be available at the desired geographical level of aggregation. All these variables are available in Canada, for example, at the aggregation level of 200-300 households.Commercial segmentation systems exist for census block groups (US PRIZM and US MOSAIC), postal codes (MOSAIC Canada), dissemination/enumeration areas (PSYTE Canada), and enumeration districts (ACORN).There are systems referring to a single street (in Belgium and Luxemburg) or even a single building (in Germany).

Diagnostic variables should be also geographically differentiated and relatively stable in time.For example, the ratio between males and females is usually not highly spatially differentiated and therefore is not a good diagnostic variable.The coefficient of variation can be used as a measure of a geographic differentiation.This coefficient is the ratio of standard deviation to the arithmetic mean.The following table shows arithmetic means and standard deviations for seven diagnostic variables.Figure 1 presents the corresponding coefficients of variation.The best diagnostic variables are the density of population and household income.The least geographically differentiated variables are urbanization and literacy and they would be the least recommended as diagnostic variables.

Table 1.Descriptive statistics

Figure 1. Coefficients of variation

Click image for larger view.

How many diagnostic variables should be used?

There are thousands of initial input variables used in commercially available segmentation systems.However, the final list of diagnostic variables is much smaller and varies between 50 (PIN system by Pinpoint Analysis Ltd.) and 120 (Super Profiles system by CDMS Ltd.). Usually, census variables constitute from 50% to 100% of the total number of diagnostic variables in these systems.The number of initial and final diagnostic variables in customized segmentation systems should be within a similar range or even smaller than for general purpose systems.

How to reduce the number of diagnostic variables?

The initial list can include hundreds or even thousands of potential input variables.Using too many variables can lead to what is known in statistics as "overfeeding the model," and this situation should be avoided. If two variables are highly (significantly) correlated, their explanatory value is very similar and using both of them is redundant.The following table presents the correlation matrix between every pair of seven diagnostic variables.These coefficients have values from +1 (perfect positive correlation), through 0 (no correlation) to -1 (perfect negative correlation).Very high correlation is marked with ** and high correlation is marked with *.For example, density of population is positively correlated with two variables only, whereas urbanization is highly correlated with every other variable and very highly correlated with five of six variables.This variable (urbanization) can be eliminated and another correlated variable will be used instead.

Table 2. Correlation matrix

Click image for larger view.

Analyzing the correlation matrix is a complex and tedious process, especially when dealing with hundreds of variables. Additionally, so-called partial correlation should be analyzed to see how relationships between variables will be changed if one variable is removed.Some other statistical methods such as multiple linear regression or factor analysis / principal component analysis can be also useful for eliminating variables. However, some expertise in using multivariate statistical methods is necessary.Tools needed for these tasks are commonly available in statistical software packages which provide fully automated procedures.

Multiple linear regression analysis can help to select the set of variables (predictors) that are the most suitable for predicting another single variable.However, the major disadvantage of this method is that one variable should be identified as a target variable.The following table was created using the multiple regression analysis.This table shows that four of seven variables were selected for predicting another variable (infant mortality).Variables not selected are: density of population, urbanization level and household income.Should they be eliminated? Maybe, if the target variable (infant mortality) is really a critical for a given segmentation system.

Table 3.The list of selected variables

Principal component analysis replaces individual variables with components that are common underlying dimensions for these variables.The following table shows how the principal component analysis can help in eliminating variables.In this table, the rows are variables, columns are components and the table entries are coefficients of correlation between variables and components.For example, the first component can be used instead of four variables: population increase, birth rate, literacy, and household income.The second component can replace two other variables: death rate and urbanization.The third component corresponds just to one variable.

Table 4.Variables and corresponding components

Components 1 and 2 count together for 70% of the total variance explained (see the table below).That means that instead of six variables, two components can be used as diagnostic variables and the number of input variables will be reduced.

Table 5.The explanatory power of components

How many segments should be created?

When the final set of diagnostic variables is ready, the number of segments should be determined.The number of segments in commercial systems varies from 25 (PIN system in UK) to more than 150 (MOSAIC Canada, Super Profiles UK).In addition, segments are often agglomerated into groups.Various commercially available systems have from 10 to 40 groups of segments.Sturge's rule can be used as a rule of thumb for determining the recommended number of segments:

Number of segments = 1 + 3.3 log (number of objects)

For example, for 1,500 geographical units, the number of segments should be between 11 and 12:

1 + 3.3 log (1,500) = 1 + 3.3(3.18) = 1 + 10.5 = 11.5

Clustering algorithms

Statistical and data mining software packages provide some common clustering algorithms such as hierarchical clustering, K-means clustering, two-step clustering or Kohonen networks.In hierarchical clustering, for example, the first step is to determine similarity between objects.Some similarity measures can be applied even to categorical (not numerical) data.As a result, a proximity matrix can be determined showing how similar any two objects are.However, the size of such a matrix often makes it impossible to display it.For example, there are more than 1.5 million objects included in the UK MOSAIC system and the proximity matrix would have the size of 1.5 million times 1.5 million entries!

The agglomeration process creates segments from similar objects and can be done in many ways.Commonly used statistical software packages provide numerous (5-10) agglomeration methods.The alternative approach to hierarchical clustering is the K-means clustering algorithm.In K-means clustering, the proximity matrix is not created, similarities between objects are not determined, and objects are not agglomerated based on these similarities.Instead, the algorithm reassigns objects to segments, and some objects can change their assignment from one segment to another.Each object is assigned to the segment for which its distance to the cluster mean is minimal.It is recommended that the number of objects per segment should be similar.Otherwise, some unique segments will contain just a few objects or even only one object.Similar procedures that are used for creating segments are also used for creating groups of segments.

The most typical representatives

The byproduct from K-means clustering is the distance from each object to the centroid of each segment.This distance can be used to determine the most typical object, such as a census block group, enumeration district, postal code, etc.This object can be identified on a map, for example, by its ZIP or postal code, corresponding images can be obtained and used to illustrate the nature of this segment, etc.The following table presents the portion of the output data sorted first by the segment ID and then by the distance to the segment centroid.For example, there are fifteen objects belonging to segment #5.The minimal distance to the centroid of this segment is 128 and the most typical object for this segment has ID 25.

Table 6.Segment membership and the distance to a centroid

Describing segments

A concise description should be provided for every segment based on such statistics of diagnostic variables as arithmetic mean, minimum, maximum, standard deviation and count.These statistics can be obtained using not only statistical software packages but also spreadsheets, GISs, etc.These tools perform such operations as "group by..." or "agglomerate by..."

OLAP cubes, also known as pivot tables (in spreadsheets), represent the most suitable tool for such a task.The following table contains basic statistics for seven hypothetical diagnostic variables and for seven segments. Maximum values in this table are marked using red color, whereas the minimum values are marked in blue.Based on the information from this table, segment #4 can be characterized as:

...having the highest mean values for the density of population (the arithmetic mean is 4975.0, no value is less than 4456.0 and the maximum density reaches 5494.0) and the level of urbanization (the arithmetic mean is 97.0, with no value is less than 94.0 and the maximum reaching 100.0).This segment can also be characterized as having both the lowest birth rate (14.5) and death rate (6.0) per 1000 people.In addition, this segment has the most differentiated values of the density of population (standard deviation of 734.0) and the most homogenous values for the urbanization (standard deviation of 4.2), birth rate (2.1) and death rate (0.0).The last number indicates that all objects belonging to this segment have identical death rate (6.0)...

Other segments can be described in a similar way using information provided in the summary table.Segment #7 does not have any extreme characteristics. However, this segment can be described by comparing its values versus national averages provided in the column Total.

Table 7.Basic statistics describing segments

Click image for larger view.

Beside these characteristics, some additional geographic information can be provided. For example, the short description of the Asian MOSAIC segment (one of sixty segments from the Canadian PSYTE segmentation system) is:

"Inner city areas in which Chinese and other Asians are concentrated.In Toronto and Vancouver these areas are known as "Chinatown." Dwellings are older, a mixture of owned, single detached, rented semis and low-rises.Many families have children and typically these are teenagers often at university."

This description was created using census variables such as population by ethnic origin, dwelling characteristics, family type, age structure and education.In addition, geographic variables were used referring to the location of this segment across Canada (Toronto, Vancouver) and within urban areas (inner city areas).

Naming segments

Giving names to segments, after describing each of them, is a relatively easy task.In fact, this can involve some fun as well, since segment names are more informal than informative.The following are examples of names from the UK MOSAIC system: Family in the Sky, Mid Rose Overspill, Rejuvenated Terraces, or Graffited Ghettos.

Mapping segments

Mapping customized segmentation systems is not a difficult task.A GIS can be an asset. Basically, two data sets are required for mapping: the boundary file and attribute table.The boundary file should contain polygons representing geographical aggregation units such as counties, census block groups, enumeration areas, enumeration districts, dissemination areas, census divisions, etc.The attribute table should contain at least two columns: the ID for every geographical unit and the segment ID.When these two tables are linked, and segments are mapped, the map legend should contain segment name and its short description rather than an ID.The newly created segmentation system can be linked to other data sets through the key item that is usually the ID for geographic units.The map below presents the portion of a student assignment on creating customized segmentation systems for the Statistical Methods course, GIS for Business Advanced Diploma program at the Centre of Geographic Sciences in Lawrencetown, Nova Scotia, Canada.

I will be very pleased to respond to any question regarding problems you might be having with creating your own customized segmentation systems.

Figure 2.Map of segments

Click image for larger view.