Data Industry Update Appendix B: How Psyte was developed

PSYTE - A geodemographic classification of U.S.neighborhoods
How PSYTE was developed: The Basics
Internal release notes by R.Bruce Carroll
Sr.VP The Polk Company - Data Engineering/Marketing Technologies Group

There are many different approaches to multivariate statistical clustering.We selected a customized version of non-hierarchical cluster analysis, known variously as "iterative centroidal relocation" or "K-means clustering".This approach adjusts in multi-dimensional space the definition of a fixed number of clusters until a criterion involving "sums of squared distances" is minimized.Put more simply, the computer tests a number of different classifications and searches for a set of clusters that maximizes the similarity of all the geographic units assigned to the same cluster and, at the same time, maximizes the statistical distance or differences between individual clusters.

The success of the whole process depends on four important issues: the selection of the optimal unit of geography for cluster analysis; the set of variables that is used; the influence (or weight) given to each individual variable; and the way in which the "distances" between each geographic unit and cluster centroid is calculated or measured.

As far as the geo unit or units selected for analysis, we basically used the smallest level of geography which provided the largest statistically reliable sample of Census long form questionnaires.In effect this meant using a mix of Tracts in suburbs and rural areas and Block Groups in the urban core and urban fringe areas for a total of 90,000 unique, non-duplicated geo-units.Once the centroids for each cluster were fixed (see below for description of process) these centroids were then used as cookie cutters to stamp out equivalent clusters for all neighborhood levels: Zip, Tract, Block Group, Zip-4 and Carrier Routes.

Neighborhood classification systems work best when they incorporate variables from as many domains as possible (such as household income, mobility, house value, ethnicity, education, language, occupation, dwelling unit type, etc.) However, a complete set of variables would number over 1,000.Working with such a large amount of data causes a statistical problem known as co-linearity.That is to say, many or most of the variables used in the clustering process would be measuring the same thing.Since successful statistical analysis depends on the explanatory variables being orthogonal or uncorrelated with one another, you have to reduce the amount of data to be analyzed.For example, perhaps the inclusion of 10 variables representing the educational attainment effectively bolsters the arsenal of income-type variables so that this (albeit important) dimension is given too much weight.

Up to this point, there have been generally two approaches to reducing the amount of data to be analyzed.The first is to select on an a priori basis a relatively small number of variables (say 40 to 50) to represent each of the dimensions the analyst thought were the most important.The problem with doing this is that a lot of subtlety hidden in the remaining 950 variables is lost.Most analysts are experienced enough to make this approach work fairly well but it won't work in clustering the neighborhoods of the United States.The structure of this country, particularly in 1998, is too complex for such a simplistic approach.

The second approach is called principal components analysis, a statistical process that groups individual variables into separate components or factors and uses these rather than individual variables as a basis for measuring the similarities between areas.Along with reducing the amount of computer processing required, this is an excellent technique for removing the distortion caused by taking too many variables from one domain and not enough from another.

However, the disadvantage of using principal components analysis is that it is less effective than using individual variables for building classification systems.The resulting classifications lack definition and will typically join together areas that, at a detailed level, have obvious key differences.In fact, the differences in the values of two otherwise highly correlated variables - the proportion of professional and managerial workers and the level of car ownership, for example - often help to pull apart areas with significant differences in customer behavior into separate clusters.

We are not going to detail how we dealt with some of these technical issues except to say that we employed new non-linear statistical techniques to mitigate the problems associated with incorporating too many highly correlated variables.We were able to identify those variables that had the most explanatory power and, more importantly, determine what weight to assign to each variable.

The result was that we were able to use over four hundred variables to create PSYTE.This meant that, other than the obvious socio-economic variables from the Census (traditionally, geodemographic clustering systems have been entirely based upon census variables) we were able to include dozens and dozens of variables (many of which are unique to Polk) that aren't usually used in a clustering exercise but which have made PSYTE really something special.

First, we used summarized measures of actual behavior at the Block Group and Tract levels, taken from our information-rich national consumer database of over 101,000,000 households (which is comprised of 36 individual sources).While these data added new dimensions to the classification process itself, they were also useful in helping us assign areas that had been built up since the census was taken eight years ago and for which there was obviously no census data (and will be used to make future updates).

Secondly, we included a set of variables that were used to capture density and settlement patterns.Why bother? Imagine four neighborhoods with the same age of maintainer, income, family structure, housing type, ethnicity, etc.These neighborhoods are in downtown Chicago, urban-fringe Detroit, Nassau County in Long Island and a new sub-division outside of Austin.

Logic and experience teach us that these areas will have quite different lifestyles and buying habits despite their identical socio-economic and demographic profiles.In other words, the "geography" or the "where" of the demography is as important as the demography itself.We created approximately 30 variables, including local density, distance to nearest urban place, concentration of business by SIC, travel time to nearest urban core and so forth to make sure that the "where" of the "who" was adequately captured in the clustering model.It is impossible to exaggerate the importance of these variables in the design of PSYTE.

As important as all above was the incredible processing speed provided by the DEC Alpha 8200.The fact is, the sheer brawn of this machine was as important to the success of PSYTE as the brain of our algorithms.We were able to evaluate the effectiveness of hundreds of classification solutions based on different combinations of variables and weights, each time testing how well those variables discriminated on key measures of consumer behavior.We were able to test a solution in several hours instead of the days and weeks it would have taken just a few years ago.

There are many examples in PSYTE where powerful algorithms, fast computers, artificial intelligence and new approaches to measuring settlement patterns have changed geodemographic clustering forever.But, at the end of the day, we would have been untrue to our Polk heritage and negligent, considering how much actual data we had available to us, if we had not let actual consumer data rule.That is to say, in solution after solution, we saw clusters appear which were not as sharp and defined demographically as we would have liked, but for some reason their purchase of products and services was very distinctive.We tried to split them, change weights, do whatever we could to send them some place more orderly, more understandable to a linear human mind, more in keeping with the neighborhood solutions of other cluster systems but they persisted in claiming a place.There are just so many fascinating examples of clusters in PSYTE which could not have been found using old techniques and simple census data.To name only a few: Country Manors, Execu-twins, Homebodies, Night Lights, Cross Roads, Church Fans ...you will recognize these places when you read the descriptions but these are not places that can be easily captured and described through simplistic demographic criteria.But they exist and the people who live in them behave and live differently than people in other places.

People may ask, "Why a 65 cluster solution and not 40, 70 or 52?" The short answer is we found through solution after solution that this number afforded the maximum amount of discrimination with the fewest number of clusters.But as any statistician will tell you, the final number is fairly arbitrary.If we had contented ourselves with simply producing a demographic construct we probably could have done with 50 clusters but the behavioral data forced us to create more positions to capture the incredible diversity of this country.But it wouldn't have been appropriate to go the other way either.You can always achieve greater discrimination with more or smaller clusters.But what may look good in the abstract will in practice always have sample size problems and be virtually useless for any real-world marketing applications.One of the principal design objectives of PSYTE was, in fact, to make sure that there were very few small and, for that matter, large clusters.

We've tried to communicate the basic content of each cluster through a numbering and naming convention.The 65 clusters have been given a number from 1 to 65 and ranked on a proprietary measure of discretionary income with Cluster 1 being the wealthiest.Each cluster has been further assigned to one of 16 major groups, with each group prefixed by a letter - U, F, S, X, R, - in order to indicate the approximate settlement pattern of the clusters.Here's the key:

Groups U1-U3: Urban Downtowns
Groups F1-F4: Urban Fringe Areas
Groups S1-S4: Greenbelt Suburbs
Groups X1-X3 Exurbs and Towns
Groups R1-R2: Rural/Farm Areas

A final seventeenth Group is coded "GQ" for Group Quarters, which contains two anomalous clusters dominated by military personnel living in barracks, and students in dormitories.

While these naming and numbering conventions - or "image triggers" - facilitate understanding and communication, there are issues and problems with this approach.First, there is the issue of homogeneity as it applies to the clustering concept (as explained elsewhere in these notes).It does not mean, for example, that a cluster in Group U2 is necessarily all urban although 90% of it will be.Secondly, there are problems with applying nicknames to individual clusters.On the one hand, there is always the possibility of offending some group.On the other hand, nicknames convey only univariate or, at best, bivariate types of characteristics.The worry is that people may think the nickname says it all and not bother to study and understand the multivariate demographic and behavioral profile of each cluster.Some of the differences between the clusters are quite subtle and need to be carefully studied.Still, nicknaming is a common device shared by most cluster systems and is a time-honored convention going back to the 1930's.It dramatically increases people's intuitive understanding of what the clusters are like.