DRAFT of October 10, 2002
The Virtual Data Grid:
A New Model and Architecture for Data-Intensive Collaboration
Ian Foster1,2 Jens Vöckler2 Michael Wilde1 Yong Zhao2
1 Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
2 Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
{foster,wilde}@mcs.anl.gov, {voeckler,yongzh}@cs.uchicago.edu
Abstract
It
is
increasingly common
to encounter
communities engaged
in
the collaborative
analysis and transformation of large quantities of
data over extended periods of time. We argue
that these communities require a scalable system
for
managing,
tracing,
exploring
and
communicating the derivation and analysis of
diverse data objects. Such a system could bring
significant productivity increases facilitating
discovery,
understanding,
assessment,
and
sharing of both data and
transformation
resources, as well as facilitating the productive
use of distributed resources for computation,
storage, and collaboration. Thus, we define a
model and architecture for a virtual data grid
capable of addressing this requirement. We
define a broadly applicable model of a “typed
dataset” as the unit of derivation tracking, and
simple constructs for describing how datasets are
derived from transformations and from other
datasets. We also define mechanisms for
integrating with, and adapting to, existing data
management systems and transformation and
analysis tools, as well as Grid mechanisms for
distributed
resource
management
and
computation planning. We report on successful
application results obtained with a prototype
implementation
called Chimera,
involving
challenging analyses of high-energy physics and
astronomy data.
1
Introduction
Much interesting research in data systems is concerned,
directly or indirectly with facilitating the extraction of
insight from large quantities of data. This problem has
motivated innovative techniques for translating da