Science Tools > Reference Materials > The Making of a Federation

Copyright © 1997 - 2025 Science Tools Corporation All rights reserved	Disclaimer

Please Note:

The following is provided for historical purposes only... Enjoy.

May 12, 1999

A Proposal to the ESIP Federation

for review and consideration by

The Federation Interoperability Group

and its subcommittees

including

The SWIL Tiger Team

The following proposal and its addendum(s) describe an implementation using the BigSur Earth Science System® for ESIP Federation interoperability purposes. The proposed system is capable of performing earth science beyond that required to be considered a Software Interoperability Layer (SWIL) and the beginning of the proposal points out benefits of going beyond such requirements. Because our proposed system is capable of functioning merely as a software interoperability layer, and because there is presently an emphasis upon a SWIL solution to interoperability, the system we propose should be considered for this purpose. We have in our addendum addressed the specific proposed criteria for a SWIL, as documented on the web page previously located at http://www.icess.ucsb.edu/~frew/ifederation/SWIL/ now archived in our SWIL files.

On Making a Federation

Earth System Information Partners

April, 1999.

The challenge of assembling a working federation out of established researchers in related but distinct disciplines is overcoming the cacophony of voices and individualism which inherently exists. With the ubiquity of computing technology in research work today, it would seem natural to use computers as a means to bring harmony, but in scientific computing we find a plethora of solutions to what appear to be common problems. Each researcher has their own favorite data types and data-type hierarchies, data manipulation and visualization tools, system architectures and research paradigms. Given tight budgets, it is hard for individual researchers to see the benefits to them which might offset the cost of changing their ways. And, often enough, researchers disagree with the technical arguments of their peers. In sum, Earth system researchers are well-educated, intelligent people who have chosen to devote their lives to a discipline other than computer science; the information systems they create are tailored to specific problems and are in and of themselves are not designed to be adapted to solve the problems of others, much less the more general problem of unifying a whole community. Yet this is precisely the need.

To address this challenge and create a computer based information system which binds disparate Earth system researchers into a cohesive Federation, the system must embrace the fundamental diversity as a feature. As an example, a user of such a system who is interested in some type of data will wish to find:

which sites and which specific computer systems have collections of that type
what scientific functions and processes operate against the type
to what logical groupings that data type may belong
what types of data are derived from that type and what processes perform that derivation
which tools (software packages) are available to manipulate or visualize such data and which are the favorites of specific researchers
what white-papers and web sites have been written about that data type
who are the individuals responsible for any and all of the above

The answers to these questions, and many others, are stored in meta-data, without which the entire enterprise is not possible. Yet, most systems cannot answer these questions because they were not designed to unify a community and too many details are left as assumptions and presumptions. The key is to have the right meta-data, organized using the right abstractions and using the best data storage technologies presently available.

We propose a meta-data based system which provides the infrastructure necessary to permit research systems to be connected using the latest technologies available in commercial systems, facilitate collaborative interaction and access by the lay public.

A Proven System with a Community-Orientation and an Evolutionary Strategy: The critical need to provide a Community-Oriented approach was envisaged by the UC Berkeley led Sequoia 2000 project team as an alternative to the Hughes EOS-DIS strategy; Emphasis was given to creating a high-performance earth-science system, fostering collaborative efforts with distributed processing, sophisticated searching, and strong scientific-defensibility features, with an initial focus in geology, hydrology, oceanography, and climate modeling. A prototype was built by the BigSur project team, and additional diverse data-sets were included to address end-user needs such as those of the State of California’s Resources Agency. The technology was demonstrated as early as the spring of ’95. In the interim the technology was commercialized in 1997, has undergone further development by Science Tools corporation, and is performing production work at various research centers today. We call this the BigSur Earth Science System (or just BigSur).The most recent Evolution pushes this paradigm even further, adding the concepts of research sites (not just individual computers) and the publishing of materials between sites.

Adaptive-Orientation: An Adaptive-Orientation converts diversity into an asset to be exploited rather than a hindrance to cohesion by learning the features of each researcher’s existing system in a direct way. The goal is not to replace existing systems, but to harness them as tools. For the individual researcher, an Adaptive-Orientation permits the system to learn the ways of the researcher rather than force the researcher to learn the ways of the system. And the system does not chase every new advancement in data access methods, such as new HDF formats, but rather it "learns" (or absorbs) those favored by the researcher as they are introduced. Because of our systems Adaptive-Orientation, it will dove-tail nicely with the DODS system and the proposed Mercury system, and will most likely live in harmony and cooperation with any other systems in the community.

Progressive-Utilization: Progressive-Utilization, permits users to pick and choose features as desired. Progressive-Utilization extends flexibility by focusing on what the researcher finds appropriate, while minimizing requisite burdens. While the BigSur Earth Science System was designed to manage Earth-system data from satellite to end-user desktop, Progressive-Utilization permits it to act as merely an electronic notebook, or just a distributed processing system, if that’s all that is desired.

Database-Centrism: Of primary interest to those wishing to create a computer based federated research information system are issues of meta-data management. Without meta-data, sharing data is not possible; to facilitate fast, reliable, easy and ubiquitous meta-data access, a competent database system is a requisite core technology. The most capable system must take advantage of the fastest, most flexible and modern data management technology available, with as little demand for programming services as possible and a maximum ability to accept data-requests from applications of all varieties. Therefore, the system should be database-centric. BigSur is database-centric and all meta-data is accessible through SQL to virtually all application programming environments.

Meta-Data Management: Researchers create meta-data as they perform their work, such as the type-hierarchies and functional processes in which research methods and paradigms are implemented. These meta-data fall into three categories: those which are "static", such as who researchers are, those which relate directly to individual data-objects, and those which are embedded as assumptions and presumptions into the information systems and application programs used. Exposing these data to others is vital. Because of the plethora of research data-types, tools, et cetera, within the community and the changing nature of these features over time, the system must be flexible, Community and Adaptively-Oriented and capable of learning. Thus, BigSur does not try to embed knowledge of these features, but rather learns (or absorbs) those favored by the researcher, and is free to pick up new advancements, such as HDF5. What are assumed relationships in other systems are explicitly stated in BigSur. (For example the type-hierarchy created when a "REGIS Aerial Photograph" is stored as a GIF file is not assumed or implied and is instead taught to the system by making an entry associating the two object types. Thus an object may be of type "REGIS Aerial Photograph" and of type GIF, both at the same time.)

By applying our Adaptive-Orientation to all aspects of meta-data, and by not locking the researcher into predefined or rigid hierarchies, the system has deeply intimate knowledge and record of most all aspects of scientific inquiry in its purview. Notable among these are lineage details about each individual scientific object, all the types and collections to which each belongs, the processes that created them and the parental input objects to those processes. Site-specific information is fully integrated with both data-objects and scientific processes, so object lineage can be maintained when a researcher uses as input data from a collaborator at another site. This permits strong scientific defensibility of conclusions reached through collaborative efforts throughout the community, regardless of where particular data-sets may be stored or meta-data managed.

BigSur can bring together and manage just the right kind of meta-data and make a reality the promise of a cohesive Federation information system.

A High-Performance, Distributed, Scaleable Architecture: Site-independence of data-objects from meta-data provides a natural ability to cooperate in a distributed environment, and provide for sharing of workload between systems. BigSur’s Distributed Processing System permits Scientific functions to be performed on any system in the network, as desired, with both process-level and system-level controls for operations staff to manage workload. While BigSur can be implemented on any ORDBMS, use of the Informix Universal Server permits the use of their R-Tree index structure, providing the fastest Geo-Spatial searches presently available in the commercial marketplace. And use of modern ORDBMSs brings the benefit of robust tools for managing a site.

Putting it all together

The system can be installed and put to immediate use. Some software needs to be written, primarily a web-based application tailored to browse the Federations collections, and the various means by which existing data will be made available to BigSur. To facilitate collection of meta-data, each ESIP must designate a technical liaison to assist others in obtaining their meta-data for collective purposes in an ongoing way.

Obtaining meta-data:

The meta-data for a site can be managed by another site – doing so is rather trivial as long as a strategy is implemented to keep the meta-data up to date – the location of the meta-data is irrelevant. In this way a small, resource-poor site can be "hosted" elsewhere.

Configuration options for a Federation information system include the use of a single "master" repository, distributed databases at each (or most) site(s), or a combination of these. The present system is capable of any of these choices and even includes a "publishing" mechanism by which meta-data managed at a remote site can be kept current automatically.

For those that wish to use BigSur’s Distributed Processing System, the system can automate meta-data update and perform publishing to a central ESIP site, if one is used. Process-scripts must be written which manage scientific processes, update the meta-data, and perform publishing between sites. These scripts may be written by our staff, by representatives at each site, or by the collaborative effort of both. There are existing scripts and templates to illustrate the way, and there is a Java language Application Programming Interface library available.

If researchers prefer to continue to use their existing systems, or use BigSur but manually run their scientific functions, another means must be devised to insert meta-data for any new objects created. Such meta-data may be inserted manually. An addition to an existing system could automate insertion of meta-data into BigSur. These new objects could be observed and have their meta-data populated by an as yet unwritten program that updates a central database much as web search engines utilize a ‘web-crawler’ to discover otherwise unknown web pages. Or, similar to the web- crawler approach, a "gateway" could be written to access meta-data stored in another information management system. The key is to have a mechanism which captures the maximum information available.

An ESIP-specific Application: An application should be written oriented primarily toward non-Federation members such as non-Federation researchers, "K-12", the public at large, and the occasional Senator. Because the underlying tool is a modern database engine, simple SQL ad-hoc queries can suffice for many Federation users, and a host of application development tools are available for those who wish to do their own interesting things. A general-purpose, public-at-large application with a web interface can easily be written because of the many tools available from the commercial market place for SQL compliant databases. In addition, the BigSur Earth Science System includes a programming tool-kit in the form of a Java API.

We propose the installation of a Federation database to be managed by the ESIP Federation, perhaps at the Federation’s Web Page providers site. Into that database, each member site will be represented. We can then populate it nearly immediately with a basic outline of what each ESIP member does. Tangible results can therefore be obtained very quickly during system implementation – nearly immediately. For each ESIP member, a long-term method of ingesting meta-data into the database must then be devised and this will vary with each site as the present means of storing this data also varies. The exact scale of this work is not known at this time, but at least a moderate level of meta-data should be fairly easily ingested. Each site will have a base minimum of data populated into the system and as much as can "easily and quickly" be done to permanently solve the meta-data ingestion strategy at each site will be worked out and implemented as fully as possible. And, an application will be authored to permit browsing of the assembled meta-data with an ESIP Federation focus as described above.

The Federation Web Site has not yet been codified nor the systems for it secured, and no one can yet realistically state what kinds of workload demands will be made of Federation-unifying systems. Authoring of an application for ESIP browsing is surely a never ending development effort, and the man-hours required to evaluate existing meta-data storage techniques and implement a collection strategy for each is clearly not small. Available funding is limited toward these efforts and realistically we know that there is simply not enough money to fund all that one might wish to do.

We therefore propose the modest sum of $X per year to fund the above outlined activities. We believe this to be sufficient to obtain fundamental goals. Should a Federation Web-Site provider be willing to participate in our efforts, we may move forward at an accelerated rate. And with designated liaisons at each ESIP members site, we may move forward at an even greater rate.

Richard Troy,
Chief Scientist

Science Tools Corporation,

1345 Wicklow Lane, Ormond Beach, FL 32174

ScienceTools.com, 386-868-3846

On Making a Federation

Earth Science Information Partners

A Proposal for Using the "BigSur" System

Addendum 'A'

Evaluation

using the proposed

Interoperability Criteria and Requirement

This document is in direct response to the criteria proposed in the "System Concept Evaluation Criteria" presentation which was previously located at http://www.icess.ucsb.edu/~frew/ifederation/SWIL/ now archived in our SWIL files. We have followed the general outline offered there. In responding, we were asked to describe how the proposed system addresses the criteria both qualitatively and quantitatively and what our view of any minimum levels of compliance is.

For the purposes of our proposal, our focus is the use of the BigSur Earth Science System® as a "Catalog based Interoperability system" focused on management and browsing of the meta-data which describes the holdings of ESIP Federation members by various user groups via the Internet. The further capabilities of this system's use for more robust interoperability and other scientific purposes represent an opportunity for further benefit of acceptance of our proposal, beyond the immediate goals at hand. It is worth bearing in mind that "catalog interoperability" is truly a minimum beginning to interoperability, and perhaps is better described as a "common results cataloging and browsing effort." Our system affords a valuable opportunity for robust interoperability because of the rich suite of meta-data easily accessible within. With its ability to intricately manage the myriad web of relationships between objects, scientific functions, and the environments that create them, and its ability to be progressively utilized for performing science, researchers may directly associate the meta-data describing their work with that of others found through the system. As more lineage meta-data is known about the creation of derived data products, scientific defensibility is enhanced for ESIP Members.

BigSur conforms to the FGDC Geo-spatial meta-data standard, and has the structure necessary to manage all meta-data common to all ESIP data. By collecting meta-data for each ESIP Member's holdings, our solution permits individual queries to be made literally of the whole of ESIP Members' holdings. Thus, the system makes the Federation's disparate members appear as one entity to browsers. Because access will be made through the Internet, the whole of the public will be afforded access. Should an ESIP member also put its data holdings on Internet-accessible storage media, our solution offers direct access to the fundamental means of data-access that makes discovery via meta-data meaningful.

While more powerful mechanisms for populating such a database are also a part of our solution, a fundamental foundation for doing so is the Global Change Master Directory. Beyond using the GCMD to populate our database, we will work with other ESIP Members to take advantage of available alternative mechanisms to gather more robust meta-data.

Overall criteria

Allow single, multiple, or composite solutions: Our proposed solution supports other interoperability mechanisms easily because BigSur already incorporates a mechanism for managing the meta-data about tools existing at other sites. New and novel hand-offs between data management tools are greatly aided by such meta-data. Our proposed system goes well beyond this criterion.
Multiple must be equivalent - All the ESIPs, all the meta-data: Because our system will manage its own copy of all of the meta-data, and because the GCMD provides a baseline for acquiring this data from all ESIP members, our proposal meets this criterion. We also propose to do more, as funds allow, to help all ESIPs provide as much meta-data as is practically possible.
Composite; should be seamless and "functionally equivalent": Please see the bullet item above which addresses this.
Security and access control; Expose subsets of catalog information [to public]: Our system manages meta-data identifying individual users and has a flexible security scheme. (It also manages the meta-data for multiple access methods into other {possibly non-BigSur} sites for varying levels of access between systems for robust composite-solutions.) It will be very easy to control access as desired.
Note that at this time, it's not clear that security features are necessary as ESIP members can control what information they publish in this system. Therefore, there will be a natural security mechanism created by ESIP members as they decide what to make public.
Use of compliance with any relevant standards; Applicable standards fall into a few categories. For meta-data storage, our system presently complies with the FGDC Geo-Spatial meta-data standard, and the similar SAIF standard. We will also ensure that it does in fact manage all data in accordance with the GCMD. This past spring (1999) we began the addition of Z39.50 and XML servers to our system, developed by our partner Dr. Konstantinos Kalpakis (USRA ESIP). As a database-centric system, it naturally complies with SQL92 (the current Standard Query Language standard), and our favored database vendor also provides us with ODBC and JDBC for remote (internet and intranet) application connections in Java, C, Basic, et cetera.
Discovery and description of services as well as data products; We propose to directly tie in to the ESIPFED.ORG web site for user discovery and description of our service. Once a user has discovered our service they will then be given the opportunity to search the holdings in our database, as well as get help regarding the services our application(s) provide. For scientific holdings, ESIP Members must provide descriptions they wish presented, though our proposal also includes an effort to collect this information on their behalf. Within the database, there will be both high level and detailed descriptions of data products and any services available at ESIP Member sites, including descriptions of other search tools and value-added features which may be provided by others for obtaining, browsing, visualizing or otherwise handling scientific data. Additionally, our system associates scientific objects and processes with links (primarily URLs) to other relevant supplemental information as provided by ESIP Members or as otherwise available and known to us; for example, to document chosen vocabularies.
Risks
- Maturity: The BigSur Earth Science System® is now over two years old, and will have been in service at an ESIP Member site (LaRC) for two years this September. The database design is mature and solid, and is continually being improved. In our proposal, what's new is an ESIP Federation-specific application which will present public ESIP Member assets to Internet users. New tools will be written to populate and maintain the database from ESIP member holdings. The foundational technologies are all mature.
  
  Further reducing risk is the great compatibility between our system and that of others. Additionally, two ESIP sites already use our system, LaRC, and OceanESIP. The ESSW ESIP (Santa Barbara) has an architecture which appears to be based on early University BigSur research. We anticipate relatively easy interchange of data with other systems used by the Federation.
- Acceptance by users and by providers: Of the two ESIP Members who use this system already, both endorse this proposal and state that they prefer this approach. The Deputy Director of the LaRC DAAC, Richard McGinnis, when asked about this system replied, "It does everything [the designer] Richard said it would do." Wisdom reminds us "the best predictor of the future is the past."
- Support: Support must exist at many different levels: the site where the host hardware resides, the computer(s) and operating system(s), the database product, network interconnectivity, the application(s), and of course BigSur itself. Each must have a support plan and be funded appropriately. Our proposal requests that the host computer(s) for the database system and associated infrastructure initially be managed, in the first funding cycle, by the site which also manages the 'ESIPFED.org' web site. Furthermore, commitments for the first year of host resources, the above-mentioned support, and all necessary database administration and application services have been made by both JPL's OceanESIP project, and Berkeley Earth Science Tools Corporation.
- Technological change:
  - Continuing support for obsolete technologies: A key observation our architecture reflects is that in doing Earth Science, most researchers already have existing systems. Therefore, BigSur has an Adaptive-Orientation that allows it to learn and adapt to an existing environments' methods, structures, and paradigms. For example, our system has embedded within it an ability to execute any existing code which can be initiated from a command line, and it can store source code and compile it on the fly, if desired. This technical ability is illustrative of our architectural perspective.
  - Migration to newer technologies: We have anticipated new technologies and have taken great care to avoid embedding into our work anything which might become stale, such as access methods. This was precisely the impetus for the decision to develop components which permit the system to recognize and call (at an executable level) other tool sets, such as visualization or data access tools. Thus, technologies that change with time are easily supplanted by new ones. The bedrock of our system, the Relational Database System, is here to stay for the foreseeable future.

Accessing the Federation via a BigSur Installation – The Users View

When reviewing the abilities of our proposed system, it may be helpful to have a perspective on user access to the system. Access to the ESIP BigSur installation can be accomplished in many ways:

Web browser – connection to the ESIP web site will direct users to a tree of HTML based web pages. This is the most fundamental method of access.
ESIP-specific, Java based application – This application will be written as a part of this proposal. It will provide search and retrieval operations, aid users in choosing selection criteria, and present results to users.
Java-based API – the Applications Programming Interface we use will also be available for users who wish to write their own code. These users may also include researchers who wish to do more than the typical browsing user, for example, for data-mining purposes.
Database vendor tools – All database vendors provide some tools and these shall be made available. Additionally, many companies have written database tools of a generic nature which can be used against virtually any database vendors database engine, and with virtually any database design. Finding these will be the burden of interested users.
Custom Applications – Custom applications have no restrictions on the database access methods they employ.
Note that the methods used to populate the environment are just user applications from the perspective of the database engine.

Catalog Interoperability Criteria

The following topics will be discussed in detail:

Discovery / search
Browse
Logical data model
User interface
Local extensibility
Technology
Scalability / Bottlenecks
Costs
Compatibility

Discovery and Search

Searches and subsequent discovery are the reasons such a system is desired. We see the plain-vanilla web-browser as the most basic common access method users will employ. We also feel that users will desire a stateful application which will provide quicker, more robust access. There will be those advanced users who wish the direct power of the system to be at their command. Furthermore, there will be different classes of users, each of which has different perspectives on the use of the data and each brings differing levels of knowledge with them as they sit down and perform searches. For example, the K-12 user might inquire about wetlands. A researcher may know explicit names of data-sets, investigations, or data-types and might wish to inquire about them by name. A Senator may be interested in discovering what types of data-products are created by a specific facility. Our plan address all three types of access (HTML, application, and direct), and at least all three of these identifiable classes of user.

For web-browser access, our system will offer a forms-based approach that will essentially let users form successively more specific queries. We will provide slightly differing interfaces for the various types of users we identify in order to assist them in forming their queries more quickly and accurately. A Java-based application will be available for users. It will be very similar in form to the HTML approach, but will have more capabilities, though we plan to keep the two very close in ability. For advanced users who wish, we will offer direct access the database via ODBC or JDBC using either their favorite database access tool (of which there are many very good ones on the market), or they may use one we provide. In this way, the system can immediately be used for very sophisticated searches. Further, developers of existing systems, such as DIAL, may wish to make the few modifications necessary in their code lines to enable their toolsets to browse all ESIP Federation data via BigSur, thus enhancing the value of their own tools. (We will make a JDBC toolkit available to aid them.)

Specificity -Collections and Granules: Our system has robust abilities to form and manage sets of data. Therefore "collections" and "granules" are equally easy to find. We do this by using basic relational database technology for defining relationships - one only need issue the correct query to gather all the components of a collection, or to exclude collections from a result. Our code will issue these queries for our application's users, and from within our Java API.
Retrieval capabilities - Ranking, Relevance, and extent of search compliance: Ranking, relevance and search compliance are concepts valuable to retrieval when one asks for multiple result sets simultaneously and then attempts to merge them, when there may be multiple names for a given searchable attribute, or perhaps when one uses fuzzy logic. We plan to help the user form queries which are likely to return small, appropriate result-sets, thereby ultimately bring the same final results but without need for these concepts. All results that return rows should be what the user requested. To aid the user in asking for what they want, wherever possible we will provide lists of valid search criteria. For searches which might return a large result-set, we will first check the number of items that satisfy a particular query and if the number is excessive, give users the opportunity to further restrict their queries in hopes of obtaining just what they are after.
Search capabilities - Geospatial ("bounding-box" - including Z), "Fielded search", Free text, Temporal, and Common vs. local attributes: The BigSur Earth Science System® is naturally capable of Geospatial queries. Our preferred database vendor, Informix, has the world's fastest (non-static) indexing method for geospatial attributes - the R-Tree. Both retrievals and inserts of new entries are fast. Our design will accept single point, "bounding box," or polygonal descriptions of geospatial location. For Z, we can use their Geodetic Datablade. (Note that BigSur has no dependencies on Informix; it just has important advantages.)

Similarly, because it's founded upon a commercial ORDBMS, and because these are simple attributes, our system easily handles "fielded search," free text and temporal searches. Our design incorporates keyword and thesaurus tables, and accepts URL "pointers" to other web sites for more in-depth descriptive information. Thus, the user is afforded a direct interconnection between their items of interest and the whole of resources available on the Internet. If better performance with "free text" searches is desired, Informix also offers a Text-Datablade.

BigSur handles common attributes as a natural part of its design. But for those features which are unique to a given ESIP Members' holdings, we offer two mechanisms for storing and searching these data within our system. The basic method we use is to store such data in the supplemental information attribute of each managed object or archetype, as appropriate. The second, more sophisticated approach is for each data provider making "local attributes" available to supply such data in their own tables as extensions to the database design. In this way, the exactly correct design which suits these unique features may be provided. For sophisticated users, access is not a problem as basic database tools immediately make this information available. For a "canned application" to provide "general purpose" ability to access these data is a bit more problematic. It is possible to take source code intended for use in database management tools and co-opt it for use in such an application; however, the funding we request in our proposal is not sufficient to cover the costs of doing so now. This can be a future add-on. A half-way approach is viable now in which the extensions are described and the sophisticated user directly connects.

For access to local attributes managed in their original environments, our system provides storage of the meta-data required for ODBC, or JDBC access. Also stored are meta-data concerning what tools exist to manipulate such data, and, when available, the details of how to launch such tools.

Browse - Specificity, by collection (e.g. coverage summaries) and by granule Options; Static or On-Demand:

Our design incorporates a browse table that manages "browse" copies of objects, when they are provided. These might be of a reduced granularity or scale, or whatever is deemed appropriate by the owner of the holdings so referenced. Such browse data may be of granules, or collections, as available. They may be provided to the user whenever desired - the selection criteria only need ask for the browse copy, should one exist (i.e. this is a subset of search capabilities).

The question of static or on-demand browse object is an interesting one. Our system is capable of generating new objects on request, and to do this, one would simply provide the correct function and its arguments and submit it (via the BigSur Distributed Processing System®). Provided functions that create browse-objects for data-types held in our collection, BigSur may easily create such objects on demand.

Logical data model

Vocabularies - Valids / Domains, Use applicable standards: As described above, our system provides keyword and thesaurus tables that associate keywords and their definitions with objects. It also contains a scientific domain table which associates keywords and domains, and can provide a reference to what is considered valid in a given scientific domain.

Because we expect our data to come from ESIP Members, we feel that the issue of deciding what keywords are valid is one for the whole Federation, or individual ESIP Members to decide. We are ready today to accept and implement any such decisions. In short, we feel that deciding what keywords are valid is outside our purview.

We feel our job is to aid users who wish to search the collection in finding their desired data. For this, we prefer to use list-boxes of choices for "valid vocabulary" at specific points, and to provide aids in "translation" among vocabularies so that queries search against actual values, whatever they may be, helping ensure that users find the data they are seeking.

Relationships - Inter-attribute, Parent-child, Thesauri, and Other TBD:

We feel that relationships are one of the very most important aspects in the entire enterprise.

It is especially crucial for those not familiar with data-sets and the environment from which they come to be able to determine how things are related. There are parent-child relationships among data, sets, subsets, and scientific processes, and many relationships among investigations, platforms (such as satellites), and instruments that collect data, all of which supports performance of science and provides the context for scientific interoperability with peers. All of these relationships bear representation within a database which purports to manage the holdings of the Federation. There are further relationships also of interest, such as those in support of the mechanics of data product generation, access, a security scheme, and so forth. The BigSur Earth Science System® manages all these relationships.

Our proposed system contains associative relationships through use of a relational database engine, and elegant database design. With our database schema, it is easy to associate the components of an investigation, the sensors of a satellite, and the products derived therefrom. It's simple, for example, to inquire about what types of data exist that were derived from data collected by a particular instrument. These associative relationships also pertain to people - for example, to discover what items result from the work of a particular person or team. It is through examination of these relationships - i.e. their availability for inquiry - that "interoperability" becomes more than just a buzz-word.

In our system, these relationships are explicitly managed. Parental data sets are referenced by each child object, when this information is available. Collections of objects are associated explicitly, and objects may belong to multiple sets simultaneously. Our schema does not impose a hierarchy, though it's simple to represent one. Further, when available to us, processing lineage is also managed, so that one can easily discover what scientific process created a given object or set of objects. In fact, the execution of a scientific process itself may be managed as an object, so that the results therefrom can conveniently be represented as a whole, whether or not they are proper "collections," or "granules." Thus, objects in our system have three possible "parents" explicitly managed: the archetypal description of the scientific function that created the object, the data that was used as input to that function, and the specific instance of that function's execution.

Because an "object" may represent a set of objects, our system permits full association of sets and objects in any manner desired. (This is done by creating an object entry for the set, and then associating set members with the set object.) Our system goes further. It associates objects with sites; it is convenient and easy to relate an object to the site where it was created, and the site where it is held, should they be different. This reference can include sufficient meta-data to provide a means of data access for such objects as well, should access to the actual objet be desired.

Our system also manages the relationships between types. For example, researcher V. Zlotnicki might have a favorite encoding format - call it the ZlotnikiType - and he might use, as a matter of convenience, the HDF5 storage format, that is represented as a file. These relationships are tracked, and permit discovery of what tools can operate against what objects. In this way a user may ask what tools exist that can be applied to a particular data object, from highly specialized tools that operate against a specific type, for example, the ZlotnickiType, to general-purpose tools such as those that manipulate files. Or perhaps the user wants to know what scientific functions may be applied. This too is an easy query; It's a robust environment when the associations are clear.

In short, we believe relationships are among the very most important features in any interoperability system. Our system has just the right architecture for managing all relevant relationships within an Earth Science framework; one is not compelled to provide all these relationships when they are not available, though there is a place for them when they are.

User interfaces

Implementation & Extensibility - Web Browsers, Java applications, Z39.50, search engines, etc:

As stated above, we plan to provide at least four forms of access: a pure HTML-based "application," a more traditional stateful application written in Java, a Java API, and direct connections to the database (for sophisticated users with their own ODBC or JDBC tools). In our view, these four are sufficient to provide robust access and excellent extensibility. We have no illusion that the funding requested in our modest proposal is sufficient to create a general -purpose application and web page that will meet everyone's hopes and dreams. Yet, we can do justice to the task, and our offering of an API and direct connections allows others to implement their own visions. Given dialogues with fellow ESIP Members, we have reason to believe that others are interested in doing so.

In addition, at the time of this writing, we are working in conjunction with our fellow ESIP partner Dr. Konstantinos Kalpakis (USRA ESIP), to expand our system's capabilities to provide Z39.50 and XML servers.

Local extensibility - Attributes, Vocabularies, Search capabilities, Retrieval capabilities, Data access,

Provision of access to local extensions:

The BigSur Earth Science System® has a whole section dedicated to the management of tools available at various sites, including the types of objects against which they operate, where to find them, and even how to start them. Thus, there is robust access to such features. We have discussed above extensions for attributes, and they are sufficient so long as we are not referring to a general-purpose browsing (searching) application that has full knowledge thereof. Since we provide direct database access, there is good extensibility for attributes, search & retrieval capabilities, and also data access through use of local tools. So, we provide robust access to local extensions.

Technology

Portability - Platform dependencies

Our HTML application will have the widest audience we can imagine. Our Java application should enjoy similar platform independence, as will our other tools, since most platforms have ODBC and JDBC tools by this date.

The database can reside on any system which provides an "industrial strength" RDBMS, though our preferred vendor is Informix, for cost and technical capability reasons. (Informix has a very wide range of platforms available.) Should the Federation not wish to use Informix, we can easily port our work, though it will mean some delay in delivery.

Implementation - Language, Special communication requirements (persistent connections, non-standard ports and/or protocols, interactions with firewalls):

As outlined above, our applications will be written in HTML, and Java, with SQL for database access. We may offer Z39.50 and XML services.

We have no unique communications requirements beyond normal Internet connectivity. Our HTML application will not use persistent connections, while our Java application will. Our system will not have any trouble with firewalls unless the Federation wishes to put the database engine behind one. We do not envision the use of non-standard protocols.

Scalability / Bottlenecks

Number of providers - We presume here that we are to discuss where the database platform will reside and how it might scale with a workload. Our system has the powerful ability to be split and divided at will because elements in our database such as objects and tools all have location information as a part of their meta-data. The only challenge here is in coordinating applications, directing them to appropriate servers. To aid this, our database design already incorporates the meta-data to direct clients to specific database engines available on any network. We will provide a methodology whereby interested applications contact a central site, and may be directed to alternative sites for more in-depth queries.

By providing the capability to redirect applications to alternative servers, a host of benefits accrue, including: fault-tolerance, a no-downtime backup capability, bottleneck avoidance, and the ability to take advantage of distributed compute and storage resources.

Number of users - Because Informix, our preferred RDBMS vendor, utilizes a multi-threaded, multi-server architecture, the system is very efficient at managing user connections. It is nearly certain that the system would experience a bottle-neck elsewhere long before a limit on the number of users would be exceeded. However, if we should discover bottlenecks or otherwise have sufficient load to warrant it, we can divide the system and overcome such problems.
Volume of data - We believe the scale of the Informix RDBMS 's storage ability is somewhere beyond ten terabytes. It would take an exceptionally large collection of meta-data to exceed such capacity! We are far more likely to outstrip the Federation's budget for disk drives! However, given our ability to split the environment, we may be able to take advantage of available resources scattered about. We can say that our database schema is well normalized, and has been de-normalized purely for performance and usage considerations. In addition, our proposal provides for a large number of Gigabytes of disk storage (from both JPL and Berkeley Earth Science Tools) which should be more than sufficient for the duration of this funding cycle.

Performance

As of the spring of 1999, benchmark reports suggest that Informix, our preferred RDBMS vendor, has a substantial lead over other RDBMS providers in scalability and raw performance. (It also has the most sophisticated and highest performing Object-Relational model.) We expect our system to have robust performance, even when filled with a large quantity of meta-data, and with a heavy user-demand.

Rates
No statistics have been provided or suggested by the Federation for determining rates of access - this is an unknown. However, the systems we will use are of relatively high capacity, and we have several of them available, so we are not concerned about this at this time. We would be delighted to have sufficient interest by the public to experience a problem with access rates!

Latencies
Similarly, no scale information has been provided or suggested by the Federation for use in determining load and therefore what latencies might exist in a deployed system. For a database system, other factors beyond just the scale of information to be searched determine search result latencies; schema design and indices are crucial. We know our database design is well normalized for elimination of redundancy, has been de-normalized for performance, is well organized into logical groupings, and has the world's fastest geo-spatial indexing methods at its disposal. This system will outperform any similarly-sized computer whose data organization is monolithic or which does not have R-Tree spatial indexing. And, by helping users ask better questions, performance will be further improved, reducing latency.

Differential degradation of capabilities
We do not anticipate any differential degradation of capabilities as scale increases because we have taken careful measures to address performance considerations. This is not to say that some user queries might consistently outperform others, or that as the database grows in size, some queries remain fast while others become slower. This is normal. We are prepared to handle this eventuality, because we have a well designed strategy for overcoming bottlenecks.

Fault tolerance
Our architecture is among the most fault-tolerant possible. Starting with a fully distributed architecture, and an application capable of connecting to alternative sites, the only single point of failure possible is the "master" site and its database. Our plan to address this is to provide a backup system, which is kept up to date all the time and to teach the applications to visit the second site should the first be unreachable. Thus, fail-over is quick and simple. We will also use the second site for a full-sized development and testing environment, and perhaps occasionally to provide additional capacity.

Costs

Distribution of costs to Providers (minimal vs. optional) and the Federation (Type 3s?)
Most of the costs of this proposal will be borne by the Proposers. "External" costs borne by ESIP Members who create data sets - "Providers" - will be related to the minimum obligations they must fulfill without regard to our system. Those that wish to do more are encouraged to have a liaison spend some time with a member of our team to help implement superior data-gathering tools to automate this aspect of our mutual interest. We are willing to take as much of this burden upon ourselves as limited funding and competing goals allow.

In addition, we have interviewed a number of Type 3 ESIPs. In some cases the burden they pose is so modest we realize it is in our mutual interest to take on the task of managing a Type 3's meta-data ourselves. This may not be true of all Type 3s, but we anticipate that for those that have computer systems where internet access to meta-data can be made available, we can work with them to gather it easily. For those that do not, the quantity of meta-data is sufficiently small that the burden it poses falls into the "noise" category as compared to other, larger sites.

Remaining costs: "plug-in" (purchase, construction, and configuration), administration, and maintenance.
Our proposal's request for funds will cover all necessary costs for implementation of our system. We will not purchase any computer equipment with these funds. Rather, existing resources will be provided for the duration of this funding cycle. These resources include compute hardware, licenses, administration, and maintenance, as outlined above.

It should be noted that our proposal requests, but does not insist, that the Federation provide the hosting of at least the central installation of our system on the computers that the Federation is using to host its web-site. This is sensible both from both business and technical perspectives. We realize that the selection of the web-site host neither anticipated nor included funding explicitly for this purpose. Because of the flexibility of our system, we can implement it either way, and are happy to work with the web-site provider to further these Federation goals.

Compatibility - Strategy for accommodating existing systems/clusters/protocols

As stated in myriad ways above, our system is compatible with and is a complement to other systems offered by singular sites or clusters. We will write some code here and there to aid the transfer and interchange of meta-data, with the goal of automating such transfers. In particular, we will write code to fetch GCMD meta-data and deposit into our database system, and take data from our system and supply it to GCMD. And we are partnering with a number of other ESIPs to further our mutual goals in these areas.

[end of document]

Feedback

Contact Us

website contact: Webmistress