Using Exploratory Data Analysis to Improve Choropleth Map Design

Choropleth maps are a widely used form of thematic mapping in which geographic areas are shaded using graduated colors to represent numeric values. Each area—typically a census unit or administrative boundary—summarizes data for a single variable. The technique derives its name from Greek roots referring to “space” and “quantity,” reflecting its purpose: visualizing how values vary across locations.
Despite its apparent simplicity, effective choropleth mapping depends heavily on informed decisions about data classification. This article explains how exploratory data analysis (EDA) can guide those decisions and help mapmakers choose classification schemes that accurately reflect underlying data patterns.
Why Exploratory Data Analysis Matters
Although most geographic data analysis involves some level of exploration, exploratory data analysis has a specific methodological meaning. Originating in statistical practice several decades ago, EDA focuses on understanding how data are distributed before applying formal models or visualization techniques.
EDA examines whether values follow recognizable statistical patterns—such as normality or uniformity—and whether extreme observations or outliers are present. It also emphasizes visual inspection using tools like histograms, boxplots, and quantile–quantile plots. Other EDA techniques, including variance testing or stem-and-leaf displays, can provide additional insight but are beyond the scope of this discussion.
For choropleth mapping, EDA should always come first. A map shows how values vary across space, while a histogram reveals how those same values are distributed statistically. These two perspectives are inseparable: the choice of classification method should be informed directly by the statistical shape of the data.
Statistical Distributions and Mapping Choices
Two idealized distributions are closely associated with two common choropleth classification strategies. When data approximate a normal distribution, classification based on standard deviation is often appropriate. When data are evenly spread across their range, equal interval classification is theoretically suitable.
In practice, perfectly uniform distributions are rare and analytically uninteresting. If every geographic unit has nearly the same value, a choropleth map provides little insight because all areas appear similar. Normal distributions, on the other hand, are common in many real-world variables and lend themselves well to interpretation using statistical thresholds around the mean.
Most datasets fall somewhere between these ideals. Many variables exhibit skewed or irregular distributions that do not conform to either pattern. Statistical tests for normality or uniformity can confirm this, and results often depend on factors such as sample size, study-area extent, or the treatment of missing and zero values. While formal tests offer precision, histograms remain valuable because they reveal the full range and structure of the data.
Choosing Appropriate Data for Choropleth Maps
Data summarized by geographic areas fall into two broad categories: absolute totals and derived values. Totals include raw counts such as total population, while derived values include ratios like population density, percentages, or averages.
As a general rule, absolute totals should not be mapped using choropleth techniques unless the areas being compared are similar in size. Mapping raw counts across uneven areas can produce misleading results, exaggerating large regions and minimizing smaller ones.
Derived values correct this problem by normalizing data. Ratios remove the influence of area size, allowing meaningful comparison across space. Common examples include percentages, averages, and densities. Some ratios are independent of area, while others explicitly incorporate it, but all serve to standardize values so spatial patterns can be interpreted correctly.
Using normalized data often reveals patterns that absolute values obscure. Areas that appear similar when mapped using totals may fall into entirely different categories once relative measures are applied.
Evaluating Normality for Standard Deviation Classification
Standard deviation classification relies on the statistical properties of normally distributed data. In such distributions, roughly two-thirds of values fall within one standard deviation of the mean, and nearly all fall within three standard deviations.
Normality can be evaluated using statistical tests such as the Kolmogorov–Smirnov test, which compares observed values to theoretical expectations. Results may show that some variables—such as average housing value—closely follow a normal distribution, while others—such as educational attainment—do not.
Transforming skewed variables, for example through logarithmic scaling, can sometimes restore normality. Visual tools reinforce these findings. Histograms reveal symmetry or skewness, while Q–Q plots compare observed data directly to expected normal values. When points align closely with the reference line in a Q–Q plot, the distribution is well suited to standard deviation classification.
Assessing Uniformity for Equal Interval Classification
Equal interval classification divides the full data range into classes of identical numeric width. This approach is appropriate only when data are distributed relatively evenly across that range.
Uniformity can be tested using the same statistical framework as normality testing, with expectations adjusted accordingly. In many cases, only a small subset of variables meet this criterion. When they do, equal interval maps are straightforward to interpret, particularly when geographic units are similar in size.
However, equal interval classification can be problematic for skewed data. It may produce classes that contain very few—or no—areas, reducing the map’s analytical usefulness.
Quantiles and Natural Breaks
Beyond standard deviation and equal interval methods, two additional classification strategies are commonly used: quantiles and natural breaks.
Quantile classification assigns an equal number of geographic units to each class. Depending on the number of classes, this produces medians, quartiles, quintiles, or other divisions. While this guarantees balanced class sizes, it can group areas with very different values into the same category.
Natural breaks classification also creates variable class widths but determines breakpoints by identifying gaps in the data. The goal is to maximize similarity within classes and differences between them. This approach is particularly useful for irregular distributions.
When variables do not exhibit normal or uniform behavior, both quantiles and natural breaks may be reasonable options. Visual inspection of histograms and maps may not clearly favor one over the other, but methods that optimize internal homogeneity often provide more meaningful results.
Comparing Classification Outcomes
Evaluating classification schemes side by side can reveal how much mapping choices influence interpretation. One approach is to map the same variable using multiple classification methods and then compare how geographic units are assigned across classes.
Statistical techniques such as cross-tabulation and chi-square tests can assess whether classifications are related, while kappa statistics measure the degree of agreement. High agreement suggests that different methods produce similar spatial patterns, while lower agreement indicates substantial variation.
Analyses based on both unit counts and area coverage often show that certain methods—particularly standard deviation and natural breaks—yield comparable results when applied to normally distributed data. Other methods may diverge significantly.
Identifying Outliers and Extreme Values
Outliers and extreme observations represent departures from typical data behavior and deserve careful attention. They may indicate data errors, unique local conditions, or meaningful anomalies. In some cases, they can be isolated into separate categories.
Boxplots are a central EDA tool for detecting these features. To compare variables with different scales, values can be standardized into z-scores, placing most observations within a common range.
A boxplot displays the middle half of the data, bounded by the first and third quartiles, with the median marked inside. Points extending beyond defined thresholds are identified as outliers. Extremely distant values—classified as extremes—are less common but may significantly affect distribution shape.
Removing or reclassifying outliers can dramatically alter statistical properties and, by extension, mapping results. For this reason, their presence should always be examined and justified before finalizing a choropleth map.















