|
Distributed information retrieval is quickly becoming a critical component of how research is done. Until the explosive development of the World Wide Web, indexing and retrieval could only be practically accomplished on co-administered, and typically co-located data collections. Now, with the availability of storage and network bandwidth, it's possible to link widely-distributed, heterogeneous collections of scientific data together to form meta-collections of unprecedented scope and expressive power. Unfortunately, existing Web protocols and browsers do not provide a sufficient infrastructure to fully realize this promise.
To a large extent, this is because scientific data has characteristics which make it a particularly tough case for distributed information exchange: it's large, complex, and stored in specialized formats. But partly there are other architectural issues in distributed information retrieval which remain to be addressed for all kinds of data.
The goal of distributed information retrieval is to enable the identification and retrieval of data sets relevant to a general description or query, wherever those data sets may be located or hosted. This capability would form the basis of a set of focused and powerful search tools which would be capable of guiding users and agents through an information space to find the desired data sets. The data sets could then be retrieved and passed to specialized browsers or analysis tools for further investigation.
To do this requires several key architectural components. One is a consistent set of metadata semantics which is powerful enough to distinguish the data sets in a collection, yet generalizable enough to have a coherent and predictable relationship to the metadata schemas of other collections. These metadata semantics will need to be developed by the stewards of data collections in collaboration with each other, lest there be semantic islands of metadata with no automated way to link between them, and thus to search across collections. A second key architectural component is a standard information retrieval protocol, or set of interoperable information retrieval protocols, which support the metadata semantics. This would enable lightweight search tools to use a single set of semantics to search many different data collections.
Scientific data sets are some of the most complex and challenging digital artifacts that exist. For one thing, they are routinely several orders of magnitude larger than other kinds of data sets. Scientific instruments produce data at increasingly high resolution, and the cost of storing that data continues to drop exponentially (outpacing the falling price of network bandwidth). As the cost of scientific instruments and analysis tools drops, the number of scientific data sets in digital form is growing explosively. Additionally, many of these data sets are being delivered more and more rapidly to scientists; increasingly, scientific data is going to be collected and archived in real time. These characteristics place unusual burdens on retrieval systems. First of all, it is not practical to build centralized repositories of scientific data, because such repositories would be too expensive to set up and maintain. Secondly, it is not practical to deliver entire datasets over the network because even moderate traffic in these datasets would quickly saturate all but the widest network pipes.
Aside from scale, scientific data is also much more complex than the "document-like" data found on the web. There are many scientific data formats, many of them non-standard. Scientific datasets are also processed through elaborate analysis steps, resulting in reduced datasets which accentuate significant features of the data. Metadata about these analysis steps is often not kept with the data, because the scientific data formats do not often accommodate this kind of metadata in any useful or standard way.
Distributed information retrieval depends on the ability to describe data sets using a consistent set of semantics, and to express search requests and results in a standard, interoperable protocol. The semantics required to distinguish scientific data sets from one another in a collection are almost invariably discipline-specific, and usually sub-discipline-specific. For instance, radio astronomers who make observations at one wavelength may use a different set of distinguishing semantics than astronomers who watch other wavelengths. To the extent that the semantics of metadata differ between collections, it is not possible to do cross-disciplinary searching. But to the extent that semantics are equivalent or can be considered equivalent for a certain kind of query, it is necessary in a distributed IR system to define high-level metadata semantics which can be easily mapped from one discipline to another. These metadata semantics constitute the context in which more specific, and perhaps unique, metadata semantics are to be automatically understood. For instance a database of cancer literature may contain the semantics necessary to distinguish thousands of diseases, but would also contain a parent concept representing "disease" which could be mapped to the relatively impoverished concept of "disease" that you might find in, say, an encyclopedia. A good example of this kind of interoperability between classification schemes is the Unified Medical Language System, developed by the National Library of Medicine. It is our opinion that data providers from particular sub-disciplines develop the appropriate metadata semantics for their sub-discpline, which they can "plug in" to the higher-level metadata semantics. To work, this would only need minimal central authority; this is analogous to the way internet domain names are administered now.
The second key component of a distributed IR system is a standard information retrieval protocol, or a set of interoperable IR protocols. These protocols must be non-proprietary or be accessible through non-proprietary protocol adapters such as proxies. In order to make IR protocols interoperate, they must adhere to a consistent model of what a data collection is. This should ideally be minimal, or at least should degrade gracefully to a minimal model of a data collection. Such a minimal model might be: "a data collection is an unordered set of opaque blobs which can be matched against a particular set of metatada semantics" (this is a very rough paraphrase of Z39.50's model). The behavior should also be minimal; for instance "retrieve records with match this query". This raises some important issues for scientific data, which because of its size and complexity must be remotely analyzed (rather than delivered intact to the client) using presumably highly complex and specialized protocols. In this scenario, the IR protocol must be able to hand off the results of a search to a subsystem which is capable of this kind of interactivity. Doing this consistently across distributed, heterogeneous collections requires a similar kind of metadata architecture to the one described above for identifying datasets, except that this kind of metadata would be used to control a broker which would connect the user and the data to the required remote analysis and visualization pipeline. We call this having a "conversation with the data".
Emerge has built components which demonstrate how to build a distributed IR system for scientific data. Our key technology is an XML-based translation engine which can perform metadata mapping and query translation. This is used in the Gazebo search gateway to translate a stateful, XML-based IR protocol into Z39.50, performing syntax translation and metadata mapping under the control of an XML template-based configuration. It is multithreaded and can execute many queries simultaneously, returning the results asynchronously. Gazelle is a Z39.50 server which can map from Z39.50 queries into arbitrary query syntaxes, also built around an XML template-based translation engine. Z39.50 is translated into calls to a simple search API which can be implemented to target arbitrary databases. Our last component is a demonstration GUI which can construct queries for Gazebo and can be used to browse the results. Gazebo was developed for the National Cancer Institute as a browser for cancer literature and other cancer-related data, but has also been successfully applied to collections of astronomy images such as the Astronomy Digital Image Library.