Processing and Analyzing Geographical Data

What is knowledge discovery?
GIS and knowledge discovery (KD), also known as data mining (DM), are considered by many not only as technologies but also as sciences or even "arts." KD helps in detecting patterns and extracting significant, previously unknown information from databases.For many years, statisticians manually mined databases looking for statistically significant patterns.This operation can be (now) performed less or more automatically.

KD overlaps with predictive analytics since it is also a business intelligence tool for predicting future trends and behaviors, allowing businesses to make proactive knowledge driven decisions.This predictive information can be easily overlooked or underestimated even by experts.Although the broad meaning of knowledge discovery refers more to traditional statistical methods, its narrow definition emphasizes such issues as automated methods, artificial intelligence, or computer learning techniques.

As technologies, both GIS and KD emerged about 15 years ago (Ed note: as mainstream applications) and their origins were stimulated by progresses in computer technology, such as employing computer graphics and dealing with massive databases.KD usually deals with a large number of attributes, whereas GIS deals with a large number of GIS features (records).As a science, KD is a part of applied mathematics or statistics.Also, practitioners of both GIS and KD tend to be interdisciplinary; they developed own specific methods and specialized tools, and have attempted to construct their own methodologies.Also, KD and GIS can be considered "arts" because they require some level of technical proficiency and competence in the application domain area.

Emerging GIS and knowledge discovery
As a computer technology, GIS is characterized by the heavy use of algorithms representing computational geometry (such as the polygon intersection algorithm) and topological operations.GIS also deals with relatively large and complex objects such as polygons with high fractal dimensions, polygons with attached topological information, networks with their attributes (for example, addresses), and large index tables. GIS utilizes spatial data structures and corresponding algorithms for storing and indexing spatial data.GIS is a synergetic technology because it represents much more than just the sum of its components. This synergy can be even more obvious because GIS, being itself a very powerful technology, benefits from integration with other technologies such as KD, customer relationship management (CRM), or enterprise resource planning (ERP).

Both GIS and KD technologies emerged partially as a result of the abundance of data and inefficiency of traditional technology to process information.For both, the progress in computer technology was critical, including advancement in data structures, database management, computer graphics and artificial intelligence.Another key factor was the interdisciplinary nature of GIS and KD.In early stages, the main contributors to GIS were geographers, computer scientists, foresters, land surveyors and military personnel.In KD, the main contributors were statisticians, computer scientists, marketers, quality controllers and medical specialists.

The early developmental stage of these two technologies (1970s) was focused on data collection with retrospective and static data delivery. Enabling technologies were mainframe computers or digitizing tables.In GIS, an example of a typical question was "What is the forest stand type in a given polygon?" whereas in knowledge discovery, a typical question could be "What was the total revenue in the last three years?"

The next stage in developing these technologies (1980s) focused on data access.GIS could answer questions like: "Where is the most suitable moose habitat?" providing retrospective and dynamic data delivery at a feature level.The enabling technological issues were vector topology, raster data structure and database management systems.The major applications were found in geology, environmental sciences and in the government.In KD, a question like: "What were unit sales in the Maritimes last April?" could be answered using the retrospective and dynamic data delivery at a record level and such enabling technologies as relational database management systems, structured query language (SQL), or open database connectivity.

In the 1990s, the focus in GIS was data modeling and analysis.Such questions as "What are the changes in the forest cover in a given area?" could be answered using the retrospective and dynamic data delivery at multiple levels.The enabling technological issues were vector/raster integration, GPS, SQL, interoperability and portable computers.The major users were corporations, municipalities and educational institutions.KD was focused on data warehousing and decision support.Questions like "What were unit sales in the Maritimes last April? Drill down to Halifax, Nova Scotia," could be answered using the retrospective and dynamic data delivery at multiple levels. The enabling technologies were online analytical processing (OLAP), data warehouses and portable computers.

Today, GIS is focused on the deployment of geographical information by answering such questions as: "How to get to the closest restaurant?" Data delivery is proactive and prospective, enabling technological issues include location-based services, Internet mapping, and geodatabases.The major users come from communication, business or the general public.KD represents data mining with such typical questions like: "What is likely to happen to Halifax unit sales next month and why?" Data delivery is also prospective and proactive.The enabling technologies include distributive algorithms and databases, multiprocessor computers and massive databases.Further progress in knowledge discovery may result from developing query languages for spatial knowledge discovery, mining under uncertainty, and using parallel knowledge discovery (Koperski, 1997).

Both GIS and KD deal with massive databases.Can they be used for handling the problem of getting too much information? As discussed by Kantardzic (2003), 61% of managers believe that information overload is present in their own workplace; 80% of them believe the situation is getting worse; over 50% of managers ignore data in current decision-making processes because of information overload; 84% of managers do not use this information immediately but store it for future use and 60% believe that the cost of gathering information outweighs its value.

The list of applications in GIS and KD is very extensive, and it is impossible to find "the most typical" one.Therefore, both technologies can be considered domain-free; they can be applied practically in any domain.GIS and KD are scale-free technologies, since they can also be applied at many different scales.There are examples of using GIS for mapping a human eye and for analyzing changes at a global or even cosmic scale.Similarly, KD is used at a micro-scale level (for diagnosing a single patient) and at a macro-scale level (for international analyses).

Mining geographical information
The most generic components of GIS are:
1. Data input
2. Data manipulation
3. Analysis and modeling
4. Data output.

There are similarities between these components and the KD phases identified within its cross-industry standard process for data mining (CRISP) methodology.CRISP is a general KD protocol developed in late 1990s and is similar to a product life cycle methodology developed in software engineering and implemented in managing GIS projects.The CRISP protocol consists of six phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment.

According to the CRISP protocol, the business understanding phase is composed of the following issues: determining business objectives, defining background and business objectives, identifying business success criteria and access situation, making inventory of resources and requirements, analyzing assumptions and constraints, risk and contingencies, costs and benefits, determining knowledge discovery goals and success criteria, producing project plan, and assessing tools and techniques.

Regardless of these similarities, the nature of KD and GIS leads to some substantial differences between these technologies.KD operates in multidimensional abstract space, whereas GIS acts mainly in geographical space.Hypotheses in KD are generated by machine learning, while in GIS hypotheses are constructed by users.Results of analysis in KD often go beyond the content of a database.In GIS, there are difficulties in mapping multivariate dependencies.

Integrating GIS and KD
GIS and KD are both synergetic, powerful, dynamic and rapidly developing technologies.There are numerous areas where GIS and KD have already overlapped - the process of integrating GIS and KD has been initiated.However, further integration can significantly benefit both technologies.

Benefits to GIS of Integrating with KD
GIS can benefit from being integrated with KD by using more efficient data manipulation tools, specialized exploratory data analysis (EDA) tools, powerful new modeling tools and better visualization tools.

Data manipulation tools represent the primary area within KD
These tools are also important, but not critical in GIS, since the manipulation of non-spatial attributes can always be performed outside GIS.Experts agree that data cleansing is one of the most time and cost consuming operations within GIS projects.It would be very beneficial if GIS could incorporate more sophisticated data manipulation tools for such common operations as detecting and replacing missing data, improving attribute accuracy, handling inconsistency in databases, intelligent data reclassification, merging attributes and appending records, and filtering data.EDA tools were introduced to GIS about 10-15 years ago directly from KD.Since then, EDA has been used as the very first step in any spatial analysis completed with GIS.KD provides more specialized EDA tools for such operations as outlier analysis, testing normality, analyzing distribution with boxplots and Q-Q plots.

KD also offers numerous powerful modeling tools that are not yet available in GIS, such as decision trees and decision rules, association rules, artificial neural networks and genetic algorithms. Some KD tools are already partially implemented in some GIS packages, including fuzzy logic or clustering.

Visualization tools play a critical role in mapping spatial attributes and enabling the art of cartography in GIS.These tools play a very important role in KD, primarily focusing on charting and graphing with statistical methods.

Benefits to KD in Integrating with GIS
KD can benefit from being integrated with GIS at various stages of its own CRISP methodology, particularly in data preparation, analysis, evaluation and deployment. Data preparation represents a critical component in both KD and GIS. Geographically referenced attributes are very common within databases being analyzed using KD.However, when using KD technology alone, many typical operations on spatial attributes cannot be performed at all. GIS can provide tools for such operations as spatial referencing, geocoding or building topological relationships among objects.GIS can also be very useful in expanding the number of attributes available for further analysis by deriving the new ones.New attributes can be derived based on geographical (metric) information or based on topological information.Newly derived geographical (metric) attributes include: length of lines, areas of polygons, distance to a closest object, directions, or density of features per area unit.Derived topological attributes include the connectivity of nodes, adjacency of polygons, information resulting from such topological operations as inside, within, intersects, contains, covers and others.

Modeling and analysis are the most powerful components in both KD and GIS, and the technologies are complementary in their approach to modeling.GIS provides more specialized spatial analysis tools, whereas KD provides more statistical analysis tools.KD lacks numerous geographical analytical tools from the domain of GIS, including spatial statistics tools (e.g., the spatial multiple linear regression), spatial analysis tools (e.g., the spatial autocorrelation), geostatistical tools (e.g., kriging or trend surface analysis), network analysis tools (e.g., the optimal path or minimal tour), surface analysis tools (e.g., the visibility analysis), numerous location-allocation modeling tools (e.g., allocating demand to a given center), and regionalization tools (e.g., spatial clustering).

Evaluation is a required step in the KD protocol, whereas in GIS an evaluation is a recommended step rather than a strictly enforced standard.However, GIS itself offers invaluable evaluation tools for mapping residuals (the difference between actual and predicted values) or analyzing the spatial autocorrelation of residuals.

Finally, in regard to the deployment phase, GIS provides mapping tools that are non-existent in standard KD.These tools, used for mapping results, can enhance the deployment phase in KD.

Enhancing geographical analysis with KD modeling tools
There are three basic groups of standard modeling tools provided by knowledge discovery: predictive, rule-based and classification tools. Predictive tools usually include neural networks, multiple linear regression, logistic regression, and C5.0 rule-based (for categorical target variables and categorical or numerical predictors) methods.The rule-based tools consists of the same C5.0 algorithm, classification and regression trees), association rules, Apriori (for categorical target variables and predictors), and generalized rule induction (GRI) algorithms.Finally, the classification tools include such algorithms as K-Means clustering, Kohonen network and two-step clustering.The purposes and results of these modeling tools, as well as their usefulness for geographical analysis, will be discussed below.

The purpose of neural networks modeling is to predict a numeric or categorical target variable.The output includes predicted values, residuals (actual minus predicted values), and corresponding rules. With GIS the actual target variable, its predicted values, residuals (Figure 1), and rules can be mapped and interpreted.

The purpose of rule induction modeling using the C5.0 algorithm is to predict a categorical target variable.The importance of predictors, predicted values and residuals constitute the output.The maps of actual target and predicted target variables and residuals can be created and analyzed within GIS.

Multiple linear regression is used for predicting a numerical target variable using numerical predictors.The output from the regression includes the selected set of predictors, predicted target variable and residuals.The maps of the actual target variable, the predicted target variable and residuals cannot be produced and analyzed within the standard KD alone - use of GIS technology can be beneficial.The difference between this tool and logistic regression is that the latter can predict a categorical target variable using categorical and numerical predictors.The output and possible maps for logistic regression are similar to those from multiple linear regression (Figure 2).

Figure 1 - Predicting GDP per capita with neural network: absolute residuals

Click here for larger image

Figure 2 - Predicting GDP per capita with logistic regression (actual vs. predicted values)

Click here for larger image

The purpose of generating rules within KD is to better understand the analyzed data by finding patterns and rules governing them.The basic algorithms are C5.0 (for categorical target variables and categorical or numerical predictors), Apriori (for categorical target variables and predictors) and GRI (for categorical target variables and categorical or numerical predictors).The output consists of rules for groups of records, including their frequency and accuracy.The geographical distribution of rules can be mapped and analyzed with GIS.

The purpose of clustering is to group records into clusters using some of the available algorithms such as Kohonen networks, K-Means, or two-step clustering.The output includes the cluster memberships, cluster description, and for the K-Means algorithm, the distance to cluster centroids.At least two types of maps can be created to show the geographical distribution of clustering: maps of clusters (Figure 3 showing cluster memberships) and maps of the most typical features for each cluster.

Figure 3 - Clusters of countries (K-means algorithm)

Click here for larger image

Factor analysis and principal component analysis are used to reduce the number of variables by replacing individual variables by factors or components.This method produces a list of extracted factors or components, values of correlations between variables and factors or components, and factor/component scores.In GIS, the analysis of maps showing the geographical distribution of factor/component scores can provide new and very valuable information that is not available within KD alone.

Classification tree modeling is another standard modeling tool used in KD for picking individual predictors one at a time and classifying them in order to optimize (minimize or maximize) a predicted value of a target variable.This tool utilizes one of many possible algorithms, including the classification and regression tree, Chi-square automatic interaction detector (CHAID), exhaustive CHAID, or QUEST (quick unbiased Efficient statistical tree).Modeling with the classification tree method provides the list of top predictors, and groups of similar records following the same classification rule.In GIS, the spatial distribution of rules can be analyzed and mapped (Figures 4 and 5).

Figure 4 - Classification tree

Click here for larger image

Finally, OLAP cubes represent another standard analytical tool in KD. OLAP cubes are used for querying, browsing and summarizing tabular information in a very efficient, interactive and dynamic way.The basic operations with OLAP cubes include slicing, dicing, rolling-up and drilling down, and pivoting.The issue of integrating OLAP cubes with GIS was discussed in my article titled "Creating and Manipulating Multidimensional Tables with Locational Data Using OLAP Cubes".

Figure 5 Map corresponding to Figure 4 classification tree

Click here for larger image

Spatial knowledge discovery resources
Numerous efforts have been made to integrate GIS and KD.Significant attempts in developing spatial KD have taken place in such American universities as the University of Utah, Southern Illinois University and Boston University.Other research centers where similar research has been conducted include Simon Fraser University (Canada), the University of Leeds (England), the University of Munich (Germany), the University of Bari (Italy) and the Russian Academy of Sciences.Spatial KD software packages were also developed, including GeoMiner or Spin! GeoMiner is a prototype of a spatial KD system, based on a spatial database server.Spin! (short for Spatial Mining for Data of Public Interest) represents a Web-based integration of KD and GIS for such applications in public health, environmental protection, seismology or marketing.This European product includes live Oracle-based queries and data visualization.

Final remarks
Today, GIS and KD are still used as separate technologies.If someone is using both, and both software packages are driven by the same operating system, data can be passed on easily (but still indirectly) between them.The idea of interoperability, developed in GIS in recent years, should be extended beyond GIS technology in order to establish the link with other business intelligence technologies such as KD, CRM or ERP.Right now, the most typical sequence of operations encountered while using GIS and the KD tools, is a mixture of both, as shown below.

1. Data preparation including data cleansing (KD)
2. Deriving new geographical attributes (GIS)
3. Spatial analysis (GIS)
4. Modeling (KD)
5. Validation (KD)
6. Mapping initial results and spatial validation (GIS)
7. Charting and interpreting results (KD)
8. Mapping final results (GIS)

Further integration of GIS and KD should focus on using spatial object-oriented and spatiotemporal databases, creating multidimensional spatial rules, integrating artificial intelligence and GIS, and spatial clustering.As an emerging discipline, spatial KD should also include visualization with multivariate thematic maps, mining remote sensing data, and maintaining the consistency and quality in spatial databases (topological and geometric errors).

Selected bibliography

1. CRISP-DM 1.0, 1999.SPSS.
2. Dramowicz K., 2002.Adding Geography to Data Mining.Data Mining Summit, Reston, VA.
3. Dramowicz K.2005.Creating and Manipulating Multidimensional Tables with Locational Data Using OLAP Cubes. Directions
Magazine, January 15, 2005. http://www.directionsmag.com/article.php?article_id=733
4. Dramowicz K., 2005.Geographic Dimension in Data Mining.ESRI Business GeoInfo Summit, April 18-19, Chicago, Illinois.
5. Dunhan M.H., 2003.Data Mining: Introduction and Advanced Topics.Prentice Hall.
6. Eklund P.W.et al., 1998.Data Mining and Soil Salinity Analysis.International Journal of Geographical Information Science, 12.pp.
247-268.
7. Ester M., et al., 1998.Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support.Data Mining and
Knowledge Discovery, 4, 2/3, pp.193-216.
8. Gahegan M., 2000.On the Application of Inductive Machine Learning Tools to Geographical Analysis.Geographical Analysis, 2, pp.
113-139.
9. Kantardzic M., 2003.Data Mining: Concepts, Models, Methods, and Algorithms.Wiley.
10.Koperski K.et al., 1997.Spatial Data Mining: Progress and Challenge.
11.Koperski K., J.Han, 1995.Discovery of Spatial Association Rules in Geographic Information Databases.[In:] Egenhofer M., J.
Ferring (eds.) Advances in Spatial Databases.Springler-Verlag, pp.47-66.
12.Miller H.J.and J.Han (eds.), 2001.Geographic Data Mining and Knowledge Discovery.Taylor and Francis.
13.Oppenshaw S., 1999.Geographic Data Mining: Key Design Issues.4th International Conference on GeoComputation.