Experiences on Processing Spatial Data with
MapReduce*
Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe
Florida International University
School of Computing and Information Sciences
11200 SW 8th St, Miami, FL 33199
{acary001,sunz,vagelis,rishen}@cis.fiu.edu
Abstract. The amount of information in spatial databases is growing as more
data is made available. Spatial databases mainly store two types of data: raster
data (satellite/aerial digital images), and vector data (points, lines, polygons).
The complexity and nature of spatial databases makes them ideal for applying
parallel processing. MapReduce is an emerging massively parallel computing
model, proposed by Google. In this work, we present our experiences in
applying the MapReduce model to solve two important spatial problems: (a)
bulk-construction of R-Trees and (b) aerial image quality computation, which
involve vector and raster data, respectively. We present our results on the
scalability of MapReduce, and the effect of parallelism on the quality of the
results. Our algorithms were executed on a Google&IBM cluster, which
became available to us through an NSF-supported program. The cluster
supports the Hadoop framework – an open source implementation of
MapReduce. Our results confirm the excellent scalability of the MapReduce
framework in processing parallelizable problems.
1
Introduction
Geographic Information Systems (GIS) deal with complex and large amounts of
spatial data of mainly two categories: raster data (satellite/aerial digital images), and
vector data (points, lines, polygons). This type of data is periodically generated via
specialized sensors, satellites or aircraft-mounted cameras (sampling geographical
regions into digital images), or GPS devices (generating geo-location information).
GIS systems have to efficiently manage repositories of spatial data for various
purposes, such as spatial searches, and imagery processing. Due to the large size of
spatial repositories and the complexity of the app