LISA
Week 4
Data Preparation for Geovisualization
4.1 Introduction
Before any map or interactive visualization can be created, geospatial data must be properly
prepared. This process though often overlooked is perhaps the most essential stage in any
geovisualization project. Effective data preparation ensures that the final visual output is accurate,
clear, and reliable. It involves transforming raw data into a clean, structured, and meaningful form
that can be appropriately symbolized and analyzed.
Poorly prepared data can lead to misleading results, visual clutter, or technical errors that reduce
the effectiveness of a map. Therefore, this week’s module focuses on the critical tasks involved in
preparing data for visualization, including data cleaning, geocoding, format conversion, coordinate
system alignment, attribute classification, and dataset integration.
4.2 Understanding Spatial and Attribute Data
Geospatial data comprises two key components: spatial data and attribute data. Spatial data
represents the location and geometry of features on Earth’s surface. This can be in vector format—
points, lines, and polygons—or in raster format, where information is stored in a grid of cells.
Attribute data, on the other hand, provides descriptive information about those features, such as
population, land use type, or rainfall.
Proper visualization depends on both types of data being accurately linked and appropriately
formatted. A well-constructed geovisualization communicates both the where (spatial data) and the
what (attribute data).
4.3 Data Cleaning and Quality Control
The first and most important step in data preparation is data cleaning. Raw datasets often contain
errors, inconsistencies, duplicates, missing values, and outliers that can distort visualizations if left
uncorrected.
Cleaning spatial data may involve:
• Removing or correcting invalid geometries (e.g., overlapping polygons, dangling lines)
• Merging multipart features into single features
• Ensuring topological integrity (e.g., avoiding gaps and overlaps in adjacent polygons)
Cleaning attribute data involves:
• Correcting spelling mistakes and inconsistent naming conventions
• Replacing or removing null or missing values
• Converting text fields into numeric formats (or vice versa)
• Standardizing units of measurement (e.g., meters vs. kilometers)
Page 1 of 4 L.L. Yevugah
LISA
These steps may be performed using spreadsheet tools (e.g., Microsoft Excel), database software
(e.g., PostgreSQL with PostGIS), or GIS platforms like QGIS and ArcGIS, which provide tools
for both spatial and attribute validation.
4.4 Data Classification for Thematic Mapping
Once data is clean, it must be classified into groups or categories that can be represented visually.
This is especially important for thematic visualizations, such as choropleth maps, where data
values are grouped into classes and each class is represented by a color or shade.
There are several common classification methods:
• Equal Interval divides the range of data into equal-sized segments.
• Quantile classification places an equal number of features in each class.
• Natural Breaks (Jenks) identifies "break points" in the data that group similar values
together.
• Standard Deviation classifies data based on how far values deviate from the mean.
Each method has its advantages and limitations. Equal interval may misrepresent skewed data
distributions, while quantile classification can group dissimilar values together. The choice of
method should depend on the nature of the data and the purpose of the map.
The number of classes used also affects map readability. Too many classes may overwhelm the
viewer, while too few may obscure important differences. Cartographers often use between 4 to 7
classes to maintain visual clarity and interpretability.
4.5 Geocoding: Assigning Spatial Locations
When working with non-spatial datasets—such as tables of schools, hospitals, or survey
responses—spatial locations must be assigned to each record. This process is known as geocoding.
Geocoding translates addresses, place names, or coordinates into geographic features. For instance,
a dataset containing addresses of clinics in a region can be converted into a point layer where each
clinic is located on the map.
Geocoding methods include:
• Coordinate-based geocoding, where latitude and longitude values are used directly
• Address-based geocoding, where street names, postal codes, or towns are matched with
geographic databases
• Administrative boundary joins, where features are linked to polygons like districts or
regions based on names or codes
Accuracy in geocoding is essential. Errors in this stage can lead to misplaced features, which
compromise the integrity of the visualization. Tools such as QGIS, ArcGIS, Google Earth, or web
APIs (e.g., Google Maps Geocoding API) support geocoding tasks.
Page 2 of 4 L.L. Yevugah
LISA
4.6 Reprojection and Coordinate System Alignment
One of the most common technical challenges in geovisualization is dealing with coordinate
systems and map projections. Spatial datasets may be created using different coordinate
reference systems (CRS), which define how the curved surface of the earth is translated into flat
map surfaces.
A mismatch in coordinate systems can prevent layers from aligning correctly on the map. For
example, a shapefile projected in UTM Zone 30N (EPSG:32630) will not align with raster data in
WGS 84 geographic coordinates (EPSG:4326) unless both are reprojected into the same CRS.
Reprojection is the process of transforming datasets into a common coordinate system. This can
be done using tools like the “Reproject Layer” or “Define Projection” functions in QGIS and
ArcGIS. It is important to understand whether a dataset’s projection is being assigned or
transformed—the former sets the label for the CRS, while the latter changes the actual geometry
to match a new CRS.
Common coordinate systems include:
• WGS 84 (EPSG:4326) – used by GPS and most global web maps
• UTM zones (EPSG:326xx) – used for regional and national mapping
• Projected systems like Ghana Metre Grid (EPSG:25000) – used in local mapping
A consistent CRS across all layers is essential to ensure spatial accuracy and visual alignment.
4.7 Merging and Joining Datasets
In many visualization projects, spatial and attribute data come from separate sources and must be
merged. This often involves joining a non-spatial table (e.g., census data) to a spatial dataset (e.g.,
districts shapefile) based on a common field, such as a district code or name.
There are two types of joins:
• Attribute join: adds columns from a table to a spatial layer based on a matching field
• Spatial join: adds attributes based on spatial relationships (e.g., points within polygons)
Joins must be done carefully to avoid data loss or mismatches. For example, if the district names
in the attribute table differ slightly from those in the shapefile (e.g., "Asunafo North" vs. "Asunafo
N."), the join will fail unless corrected.
After joining, the enriched dataset can be symbolized and visualized. This process is critical for
thematic mapping, where values like population, literacy rate, or disease incidence must be linked
to spatial features.
4.8 Format Conversion and Data Compatibility
Geospatial data exists in various formats. Common vector formats include:
• Shapefiles (.shp) – widely used but limited to short field names and basic geometry
• GeoJSON (.geojson) – web-friendly, supports interactivity
• KML/KMZ (.kml, .kmz) – used in Google Earth
• GPKG (.gpkg) – modern, efficient, and supports multiple layers in one file
Page 3 of 4 L.L. Yevugah
LISA
Raster formats include:
• GeoTIFF (.tif) – stores satellite imagery and digital elevation models
• JPEG/PNG – used for background layers or overlays
During data preparation, converting between formats may be necessary. This should be done with
attention to data fidelity, coordinate systems, and attribute compatibility. QGIS and ArcGIS both
support “Save As” and “Export” functions to handle format conversions.
For interactive web maps, lightweight formats such as GeoJSON or CSV with coordinates are
preferred. Heavy formats like shapefiles may be unsuitable for web-based visualizations due to
size and browser limitations.
4.9 Handling Large Datasets and Performance Optimization
In geovisualization, especially on web platforms, large datasets can degrade performance. To
optimize, it is common to:
• Simplify geometries to reduce the number of vertices in polygons or lines
• Filter data to show only the most relevant features
• Use tiling systems or caching for web maps
• Convert layers to raster or image tiles for visualization-only purposes
Simplification tools in QGIS or ArcGIS allow users to reduce dataset complexity without losing
essential spatial detail. For example, country borders can be simplified for global maps without
affecting the overall shape.
4.10 Summary
Data preparation is the foundation upon which all successful geovisualizations are built. From
cleaning and classifying data to aligning coordinate systems and joining datasets, each step plays
a critical role in ensuring that the final visualization is accurate, meaningful, and visually effective.
As students’ progress to more advanced forms of geovisualization, including interactive, temporal,
and 3D formats, they will increasingly rely on the skills and concepts introduced in this module.
A well-prepared dataset not only enables better analysis but also supports stronger, clearer visual
storytelling.
In the next week, we will explore color theory and map aesthetics, focusing on how to visually
encode data using appropriate color schemes, styles, and layouts to enhance understanding and
accessibility.
Page 4 of 4 L.L. Yevugah