DISCO: An Internet-based initiative to facilitate data integration for the Neuroscience Information Framework
Luis Marenco (Yale Center for Medical Informatics), Kei-Hoi Cheung (Yale Center for Medical Informatics), Rixin Wang (Yale Center for Medical Informatics), Gordon M. Shepherd (Department of Neurobiology, Yale Univ.), Perry L. Miller (Yale Center for Medical Informatics), NIF Consortium (http://www.neuinfo.org/)
The amount and diversity of neuroscience information available over the Internet have been growing significantly in recent years. Integration of such diverse data sets has become increasingly important for enabling and advancing neuroscience research. Large-scale data integration has been hampered by obstacles including data heterogeneity, insufficient use of semantic data structure and annotation, and lack of standard terminology/vocabulary. In addition, despite the efforts made by search engines such as Google to index vast amounts of information over the Internet, a large proportion of neuroscience data remains inaccessible to computer applications. Current searches on the Internet are mostly applied to the exposed portions of the literature, Web sites, and databases that are intended for the human user to understand. Their accuracy is limited by the inability of the search programs/algorithms to understand the context to retrieve correct results using keyword or Boolean searches. While there are many neuroscience resources accessible over the Internet, there is no automatic mechanism in place to facilitate sharing and integration of a variety of types of neuroscience data in a machine-readable fashion.
The Neuroscience Information Framework (NIF) (http://neuinfo.org) has begun the effort to solve the problem of integrating neuroscience data on a global level. The NIF approach addresses some of the issues that are not addressed by current Internet search technologies. As part of the NIF data integration effort, we have developed an information discovery and integration approach, DISCO (http://disco.neuinfo.org), to facilitate data sharing by different resources on the Internet. DISCO provides an XML-based format to describe different types of information to be harvested by automated aggregator systems such as NIF.
DISCO consists of a set of extensible format specifications to allow neuroscience resource providers to share their information over the Internet. DISCO currently includes six types of information that can be exposed via the following DISCO services.
a) SiteMap: used to describe high level information about a resource,
b) Terminology: a glossary of the terms used by a resource,
c) Interoperation: a logical description of how to access data provided by a resource for the purpose of interoperation with other resources,
d) Schema: used to describe the database schema of a resource,
e) LinkOut: used by a resource to create data links that extend Entrez NCBI's information about publications and data entries (e.g., neurons and genes),
f) News: used by a resource to publicize special issues and activities.
To process the DISCO content, NIF has a specialized DISCO system capable of harvesting the data from resources implementing DISCO. In addition, we have developed a DISCO Dashboard (http://disco.neuinfo.org) to help track, manage and interoperate the shared content of these resources on NIF. Figure 1 shows an architectural overview of DISCO and a portion of the DISCO Dashboard Web interface listing several NIF-registered resources and the various DISCO services utilized by each of those resources. NIF uses the information/data harvested by DISCO as a basis that can be extended to support other technologies including ontological mappings, horizontal integration, and data update alerts to help develop an evolvable global scientific portal architecture in support of neuroscience research.