NASA Global Change Master Directory: Harnessing the Power of Java Remote Method Invocation in a Heterogeneous Federated Environment

 

K. Nagendra, O. Bukhres, S. Sikkupparbathyam, M. Areal

Computer Science Department

Purdue University School of Science

Indiana University Purdue University

Indianapolis, Indiana. 46202

Z. Ben Miled

Purdue University School of Engineering

Department of Electrical and Computer Engineering

Indiana University Purdue University

Indianapolis, Indiana. 46202

L . Olsen, C. Gokey, D. Kendig, T Northcutt, R. Kordova, G. Major

Global Change Data Center NASA

Goddard Space Flight Center

Greenbelt, MD. 20771

 

 

 

 

Abstract: The Global Change Master Directory (GCMD) is an earth science repository that specifically tracks research data on global climatic change. The GCMD is migrating from a centralized architecture to a globally distributed replicated heterogeneous federated system. One of the greatest challenges facing today’s software engineers is the integration of heterogeneous system without compromising local autonomy, reliability and transparency. We the NASA team were faced with similar challenges while designing and implementing the next version of the GCMD software (Version 8.0). We propose to design and develop an object-oriented system architecture using Java, RMI (Remote Method Invocation) and JDBC that would enable other sources of similar information to be integrated into the GCMD system, without making much change to the local system itself. This paper describes the RMI component and addresses the issues of heterogeneity, distribution and autonomy in the GCMD system.

Keywords:

Distributed, Interoperability, Internet, Distributed Object Management, World Wide Web, Java, Java RMI, Component, JDBC, XML, Interface, Object, Object-oriented.

 

 

 

1. Introduction

The Global Change Master Directory (GCMD) [GCMD70] is a repository that contains information on the changing environment collected by various agencies including the United States government agency Global Change Data Center (GCDC) at NASA. Figure 1 shows other agencies that actively collect similar type of information. The GCMD has been in existence for the past 11 years. It is a database that is growing in importance since a lot of recent research on global climate change is available and the GCMD is one of the few organized efforts to create a system by which uniform storage, access and retrieval is being made possible. The long term goal of this effort was to provide science users with the ability to find and view information about science data regardless of where the data resides. This experiment became what is known as the International Directory Network (IDN).

Fig. 1 - The IDN Network - Courtesy: http://gcmd.gsfc.nasa.gov/ceosidn

Currently the members (sites) of the IDN network have independent processes for collecting, analyzing and displaying data. There is no interaction or information sharing between the members of the IDN network. This result in duplication of effort since some members might collect and analyze the same information without realizing that similar information exists on other sites. Further, due to the independent nature of these sites, the research community loses the benefit of accessing data on other sites. The goal of the GCMD software (Version 8.0) is to interconnect the members of the IDN network and provide a mechanism to share data across network. In this way a user familiar to a site (say GCMD NASA) can access information from other sites (ESA).

It is important to note that GCMD does not contain any of the climatic data and information. It only serves as a repository of metadata information. The metadata contains link to various research and contains information such as the location (e.g. South America, Europe, etc) of research documents and their characteristics such as the document discipline (Oceanography, Seismology, etc). The metadata standard in the GCMD system is the DIF (Directory Interchange Format). DIF is a standard used to store and transfer data within the various sites in the IDN network and consists of a collection of fields that detail specific information about the GCMD data. A DIF entry has approximately 4000 bytes. Further we have implemented a schema translator which translate the sites internal data representation (database or file format) to the DIF format and vice-versa. The translator is custom built for each site.

This paper describes the internetworking aspects of a distributed architecture for the Global Change Master Directory using Java Remote Method Invocation [RMI]. The system was implemented using Java because RMI is tightly coupled with the Java language and Java is portable across various operating systems. The objective of the new system (MD8) is to migrate the GCMD from client-server architecture to a federated architecture that is totally automated while providing sites participating in the IDN network with the autonomy necessary in maintaining their system. Each site has the autonomy in the choosing the operating system, the storage medium (relational database, flat files) and the user interface. Table 1 lists existing systems in the IDN (International Directory Network) framework, the hardware, software, operating systems and databases that are currently used by different agencies.

Agency/Country

Software

Hardware / OS

Database

GCMD, NASA, USA

Isite

IRIX64 oxygen 6.4

IRIX gcmd 5.3, SunO Sgcmd2 Generic Sparc

Oracle

Italy

Isite

IRIX Ganymede 5.3

Oracle

Japan

Not documented

SunOS hcatac 02

Oracle

Germany

Isite

Sun OS P1D1 5.4

Oracle

Geneva

Isite

OSF1

Not documented

Canada

Not documented

Not documented

Not documented

Brazil

Isite

Lynn

Not documented

Argentina

Isite

Not available

Not documented

Australia

Isite

SunOS atlas 5.5

Oracle

NewZealand

Not documented

Not documented

Not documented

Netherlands

Not documented

Not documented

Not documented

Table 1 - Existing IDN Infrastructure

2. Requirements

The complexity involved in designing MD8 includes resolving distribution, heterogeneity, autonomy and duplication issues. The IDN network depends on the internet for a medium and therefore we have to take into account the issues of bandwidth, latency and availability [PCL97] of the internet. Figure 2 provides an overview of the GCMD MD8 architecture used to connect members of the IDN network.

DIF entries are defined to be "owned" by the site that created them. Ownership is exclusive with the exception of the MD8 master site that jointly owns all entries. The dual ownership of entries simply means that while new entries are always broadcast by the respective originating nodes, updates and deletes can be done both by the originating site as well as the master MD8 site. This special trump privilege is provided with the MD8 master site to be able to remotely control some of the sites in the IDN network. Although the propagation of DIF entries and updates will be automatic, the local site staff will manually control the actual validation and acceptance of a propagated entry. This implementation efficiently addresses the distributed aspect of the MD8.

 

Figure 2 - GCMD MD8 Architecture

Transaction Management in a distributed environment has been addressed by various papers from as early as 1979 [GM79]. Most research in the early 1990’s on distributed transaction management resulted in implementations of distributed transaction models systems such as InterBase[JO93]. Interbase has a coordinating server that implements a distributed transaction manager (FLEX) which interacts with different databases and performs a commit or a rollback depending upon the success of individual transactions. With the emergence of the internet as a common information medium, and given the issues of dependability and availability of the internet, the need for asynchronous distributed transactions [LPPB] is escalating.

Heterogeneity a complex issue [FRK98] in MD8, needs to be addressed at four levels: storage format, schema, DBMS and operating system. The GCMD offers relational database support as well as text-file based support. Although DIF, as a schema, is the same in both these formats, the physical data storage in DBMS based storage is entirely different from flat text files based storage. Moreover different IDN sites may follow different schema for storing the research information that flows into their site. The most widely accepted type of schema is the DIF. Interfaces to relational databases are implemented differently in different RDBMS systems. Oracle is the predominant database system in the IDN network. However, other relational databases such as Informix and Sybase are also being used. Similarly UNIX is common but is not the only operating system used in the IDN network. Thus the proposed system needs to interoperate across multiple operating systems.

One very significant aspect of the MD8 is the need to maintain autonomy of individual nodes in the IDN network. Each site has the right to accept or reject an entry that is propagated from any other site in the network. The need for autonomy comes from the fact that each node could possibly be specializing in an area of the globe and could treat some of the data as irrelevant to its audience.

MD8 is also installed on servers that are truly internationally located thus leading to various issues of security and privacy. Given the global nature of the data in the Master Directory, it should be made available to everyone in the world. However security cannot be compromised. Apart from the usual password protection, recent issues like denial of service need to be prevented. The usage of Java’s Security Manager and RMI’s policy file provide an added layer of security.

Given the fact that the directory is a repository of information that is input by change researchers, and given the concept of distribution of all insertions with ownership by originating nodes, it is possible for example that a change researcher inserts a new entry both in node A and node B thus creating the possibility of identical data referring to the same piece of research but that exist with distinct identifiers. This case can be prevented by identifying a set of crucial elements of the data, which, when alike between any two DIF entries, will flag a possible duplication.

In summary, the problem has several dimensions, with the most complex being internetworking and distribution, followed by heterogeneity. The need to provide autonomy has to be addressed, and essentially contributes to the need for an asynchronous solution because of the human guided validation of propagated entries.

3. Implementation

To achieve interoperability between the sites in the IDN network, we introduce three components – the Remote Announcer, the Local Data Agent (LDA) and the RMI component that together form the backbone of our implementation.

RMI Components are a set of published interfaces that the LDA/Announcer uses to interact between the various sites in the network. The basic components of an RMI enabled program are the interface, implementation of the interface (LDA), the Server program, the RMI Registry, RMI Client Stubs and Server Skeletons, and the client (Announcer). The Server program registers an LDA object for each site in the network other than itself. In this way every client has its own LDA to cater to its request. This result in reduced deadlock and improved speed since each client has a corresponding LDA running on a separate thread. Figure 3 shows the interaction between the Announcer and the LDA components of any two sites.

 

Figure 3 - GCMD Site-to-Site Interaction

The basic functionality of the LDA is the server side implementation of the RMI interfaces (services). Each local data agent will provide a uniform external interface and a custom internal representation based on the target database type such as Oracle and Sybase. Each local data agent will also be implemented as an independent component in the broad object-oriented architecture that interacts and cooperates with the other components in the entire system. This will eliminate the problem of requiring the international research community to make double updates, one in their own local format and one in the DIF format. The implementation of local data agents as distributed components will also directly benefit the system by providing the much needed scalability and flexibility.

Table 2 lists the interface’s that describe the functionality of the Local Data Agent (LDA). This interface is implemented in the RMI based server so that the methods can be invoked by the Announcer counterpart.

 

Interface IDN_LDA_Interface

{

synchronized long PutActionItem( int fromMirrorId,

String mirrorSecurityString

String ge_id,

int srcSequence,

int entryType,

String pendingAction,

DIF difObject);

synchronized long AckActionItem( int srcSequence,

int fromMirrorId,

String actionStatus,

String statusText);

synchronized long AcquireLock( int fromMirrorId,

String mirrorSecurityString,

String ge_id);

synchronized long ReleaseLock( int fromMirrorId,

String mirrorSecurityString,

String ge_id);

}

 

Table 2 – IDN_LDA_Interface

When a new entry comes into site A1, the site propagates the entry to sites A2…An-1. Site A1 uses its Announcer module to invoke methods of the LDA in every other site. Thus, to propagate an insert or an update, it invokes the method PutActionItem. Similarly, when a propagated item has been through the validation process, an acknowledgement is sent to the site that originated the request. This is accomplished by invoking the AckActionItem method in the originating site. Locks are required for updates because of the dual ownership of DIF entries between the GCMD site and the originator site. Thus, methods are available for acquiring (AcquireLock) and releasing (ReleaseLock) locks. One of the less significant actions targeted in MD8 is the ability for a site to send a request to the GCMD master site to change a DIF entry so that it is acceptable to the IDN.

The Local Data Agents proposed here resemble the Mediators [YSH96]. The Local Data Agent in this paper is responsible for the incoming propagated traffic of entries, but does not have an active role without the support of the Announcer. Mediators are also employed for schema translations in other heterogeneous systems.

The Announcer is a Java program that is scheduled to run on a daily basis in each of the mirror sites. While this process will run on the source mirror sites, it will call Java RMI based local data agents in the target mirror sites. The methods are invoked via a Stub that serves as a proxy for the remote object. The Announcer obtains a reference to the remote object by invoking the lookup method of the java.rmi.naming object which uses RMI’s URL based naming scheme to specify the remote object. Figure 4 show separate LDA object exist in the server to service every client (Announcer) in the IDN network. MD8 implements asynchronous transactions by storing a schedule of inserts, updates and deletes and broadcasting them through a remote connection to the respective sites.

Figure 4 - Interoperation between Local Data Agent Objects and Announcers

Assuming the existence of a single LDA object in each site, in the worst case, (n-1) sites are competing to execute remote methods on a given site. Java’s RMI mechanism runs methods within the same object on different threads for different clients. However this does not preclude from the wait on synchronized methods. To execute methods without loss of atomicity, Java’s synchronized monitor construct is used to gain mutual exclusion. Propagations from one site really do not interfere with a simultaneous propagation from another site, even when the target site is the same. Thus, we want the flexibility to execute methods in a synchronized methods within objects for a fixed pair of source and target (Ai, Ak) sites, but at the same time allow the possibility of simultaneous execution of methods for the pair (Ap, Ak).

The GCMD staff validates DIF entries originating at a given site such as San Diego or Budapest, before they are inserted into the local site’s database. Every entry that gets added into the local site’s database subsequently gets queued for propagation to all remote IDN sites. The Announcer program will broadcast the new entry to all the other sites, and awaits their acknowledgement. Such broadcast entries can either be accepted or rejected by the staff at every IDN site. In both cases, an acknowledgement is sent to the originating IDN site with a reason for rejection, in the case of rejection. Updates to a DIF entry can be done at the site that originated that entry or the master GCMD site at Goddard. However, the GCMD entry cannot propagate an addition of a DIF entry to other sites. Similarly, deleting an entry can be done both by the originating site of the entry as well as the GCMD Goddard site.

Asynchronous [LPPB] replication is really the best way to accomplish such an implementation, given the global nature of the system and the frequent occurrences of network failures. Asynchronous transaction model fits in well with our implementation since the data in the IDN sites need not be in sync immediately. Hence the temporarily out-of-sync condition is well tolerated. Further more, our design stipulates that only one site (owner of DIF entry or GCMD trump) can initiate and complete an insert, update or delete of a unique DIF entry. This satisfies the unidirectional update propagation property of asynchronous transaction. However, a locking mechanism needs to be in place to cover for the possibility of not-in-sync updates that happen simultaneously at the GCMD site as well as the originating site. Further asynchronous transactions improve response time and prevent global deadlock. The two broad approaches to handle the problem of replica updates [YRRSA] in a distributed database system are eager and lazy. The eager approaches update all the replicas of an item as a part of a single transaction. Under lazy propagation, only one replica is updated by the transaction itself and separate transaction runs on behalf of the original transaction at each site at which update propagation is required. Our implementation uses lazy propagation whose benefits are mentioned in paper [YHB97].

Relevant information about the server and the other sites with their I/P locations on each IDN site are stored in serialized local storage. This storage is implemented as a Java serialized file. The number of IDN sites to which a given site and server have access can increase over time. The actual list of sites and the I/P addresses can be input by the GCMD server administrator at the time of system installation or can be added later.

 

4. Migration Benefits

The MD8 discussed in this paper will serve the worldwide environmental research community better than the present GCMD system, because it overcomes the technical difficulties of the present system which included the scalability, automation, distribution and maintainability problems. Most important it is based on a federated architecture which automates the process of sharing change research entries.

The MD8 is a scalable system. For example, if a new country wants to enter the IDN circle, then all that is needed is to install the MD8 components at the new country agency’s site. During such installation, the administrator of the new site will simply need to enter template data that will help the server talk to the local database if one already exists. Then the server will notify all other IDN sites of its existence.

The MD8 is a highly flexible system by design. The flexibility is three fold. First, the translator component of the GCMD handles schema differences that may exist across different IDN sites’ data. The translation of local schemas to the common DIF format adds more flexibility to the system. Second, the Heterogeneity component handles differences across commercial database systems such as Oracle and Sybase. Thus the server software can operate on top of any existing database system. Third, MD8 is implemented using Java which allows portability across different operating systems.

The MD8 design is a fully federated system. There are servers executing at each IDN site. About 70 to 75 % of the metadata is fully replicated across the network. Temporary breakdown of a given server will not affect the IDN as data is updated on every site on a regular basis. The frequency of these updates is site specific. The 70 to 75% data replication is a substantial improvement compared to the current 5% to 10% of global data that the GCMD stores in its database.

Very little maintenance is required for MD8. Data validation is a manual process and does need to be done as an evaluation step to include or exclude an entry in the database. However all sites are automatically updated with new data. The ability of the design to automatically propagate and process database changes on different sites goes a long way in low cost maintenance of the system.

The system was implemented using state of art technology, including Java, XML and

distributed databases. Since our development is in a pure Java environment and transaction processing is being handled by our application, we chose RMI over CORBA and other middlewares. Also, Java RMI is a Java specific middleware that is tightly coupled with the Java language. Hence there are no separate IDL (Interface Definition Languages) mappings that are required to invoke remote object methods. RMI extends the Java exception classes to deal with remote failures. The new classes extend error handling across Java virtual machines (JVM). RMI also supports Distributed Garbage Collection that ties into the local Garbage collectors in each JVM.

The MD8 implementation also ensures a base standard of SQL which can be executed on several different vendors’ relational database implementations.

5. Related Work

The Interbase project [JO93] integrates pre-existing systems over a distributed, autonomous and heterogeneous environment via a tool-based interface. It supports heterogeneous applications without violating the local autonomy of component systems.

Interbase has two major components: Distributed Flex Transaction Manager (DFTM) and a set of remote system interfaces (RSI’s). The DFTM interprets and coordinates the reliable execution of global transactions over the entire system. It also provides a unified and flexible interface, the IPL language, with which system programmers can specify the data and control flow of a global transaction. This paper extends the Interbase work. MD8 differs from the Interbase implementation in that it uses an object-oriented approach.

DCOM[DCOM] ,CORBA [CORBA] and Java RMI are all protocols for connecting distributed systems. DCOM will work only across Windows NT or Windows 95/98 platforms. Significant research on implementing CORBA based object-oriented multi-database systems (MOOD, MIND) is discussed in [DDO96, DDO98,EGO95]. This work employs the CORBA standard for integrating heterogeneous systems across networks. MD8 employs Java RMI as means of connecting IDN sites across networks and exchanging information. Java allows ease of deployment irrespective of the presence of an ORB in a given site.

The papers [SDE98] and [PADE98] provide insights into addressing the problem of database heterogeneity in a distributed internet-based environment using the XML script language. The latter mentions that the creation of XML wrappers will be necessary to make XML the universal mechanism for addressing heterogeneity. While XML provides a presentation mechanism for query results of heterogeneous sources, the reverse process, which is writing back to heterogeneous sources, is not well established in either research paper. The Garlic project at IBM [MPI97] describes wrappers and how they can provide transparency to underlying data. We have used wrappers to encapsulate the DIF information.

The challenges facing today’s distributed database systems, in terms of interoperability,

proactivity, autonomy and interactiveness is discussed in paper [PCL99] . The use of Java technologies gives our implementation the flexibility to run on various platforms while the usage of the DIF schema to encapsulate data provides autonomy to the underlying database system.

Paper [LPPB] introduces the need for asynchronous transaction and compares it with synchronous transaction. It goes further in defining the basic requirements for asynchronous transactions based on which our implementation is designed. In our implementation the interaction between LDA and the Announcer is asynchronous. Even though our system uses the asynchronous transaction model, we have made provisions to handle situations of repeated transaction failure.

Performance issues involving middleware technologies are addressed in paper [FRK98].

Author evaluates the performance of 3 gateways - Miracle, DJ, EDI/S mainly for database operations. Our implementation runs every client’s request on its own thread space in the server. This enhances the overall performance of the system both in terms of speed and scalability.

7. Conclusion & Future Work

The need for a new version of the GCMD software is motivated by the advances in technology (object-oriented architecture, Java programming language, middleware technologies) and a concerted effort on the part of NASA to expand the scope of GCMD. The MD8 is a federated system that employs some of the latest technological tools. Our solution highlights the use of Java, RMI and other Java technologies to implement a truly scalable, distributive and cost-effective system.

In the development of the MD8 system, the GCMD team considered several development tools very carefully before selecting the Java platform. The disadvantage of Java is in the slow execution speed of the byte code, but for the GCMD application speed is not a major concern. The GCMD application is based on the lazy replication model and more importance is attached to the completeness of a transaction rather than speed. We need a system that is totally heterogeneous and that can reliably do distribution with minimal blocking on method invocations.

One important aspect of the system is the use of serializable files for distribution and propagation. Some real speed issues in traversing through serialized I/O in a Java RMI based server system have been overcome by substituting serialized file I/O with database counterparts in sites where relational databases are available. However, it is important to note that there are some sites in the IDN network that do not use any conventional database systems.

The current implementation utilizes Java’s serialization capability to transport objects across the network. In future, we want to explore the use of XML as a way to encapsulate information and utilize it as a platform for handling heterogeneity. Further we would like to integrate systems developed in a non-java programming environment such as C++.

 

References

 

[CORBA] CORBA and OMG Information Resources, http://www.acl.lanl.gov/CORBA

[DCOM] DCOM Solutions in Action, http://www.microsoft.com/com

[DDO96]A multidatabase implementation on CORBA (MIND), A.Dogac, C.Dengi, G.Ozhan et al. SRDC, METU, Turkey, 1996.

[DDO98] Building Interoperable Databases on Distributed Object Management Platforms, CM, A.Dogac, C.Dengi and T.Ozsu, Feb 1998.

[EGO95] Experiences in Using CORBA for a Multidatabase Implementation, Ebru Kilic, Gokhan Ozhan, C.Dengi et al, SRDC, METU, Turkey, 1995

[FRK98] The Heterogeneity Problem and Middleware Technology: Experiences with and performance of Database Gateways. Fernando de Ferreira Rezende, Klaudia Hergula, Dept.CAE-Research (FT3/EK), Diamler Benz AG – Ulm-Germany, Proc of 24 VLDB conference New York, USA, 1998.

[GCMD70] GCMD System Documentation Software version 7.0 http://gcmd.gsfc.nasa.gov/

[GM79] Proving Consistency of Database Transactions. Georges Gardarin and

Michel A. Melkanoff, VLDB 79.

[JO93] USENIX, Journal of USENIX Association, The Implementation of Cooperative Mechanisms among System Components in a Heterogeneous Multidatabase Environment, by Jiansen Chen, Omran Bukhres et al. Computing Systems, vol.6 no.3, Summer 1993.

[LPPB] The Need for Distributed Asynchronous Transactions. – Lymon Do, Prabhu Ram, and Pamela Drew. Boeing Phantom Works.

[MPI97] Don’t Scrap It, Wrap It - A wrapper architecture for Legacy Data source. -Mary Tork Roth, Peter Schwarz. IBM Almaden Research Center. Proc of 23 VLDB conference Athens, Greece, 1997.

[PADE98] Virtual Database Technology: XML, and the Evolution of the Web, STS Prasad, Anand Rajaraman, DE Bulletin, June 1998.

[PCL97] The Network as a Global Database: Challenges of Interoperability, Proactivity,

interactiveness, Legacy. Peter C. Lockemann et al. Proc of 23 VLDB Conference Athens,Greece, 1997.

[RMI] Java Remote Method Invocation, http://java.sun.com/products/jdk/rmi/

[SDE98] Resistance is Futile: The Web will assimilate your database, Susan Malaika, DE Bulletin, June 1998.

[YHB97] Replication and consistency: Being lazy helps sometime -Yuri Breibart and Henry F. Korth, Bell Laboratories, PODS 97 Tucson Arizona USA.

[YRRSA] Update Propagation Protocols for Replicated Databases. –Yuri Brietbart, Raghavan Komondoor, Rajeev Rastogi, S Seshadri, Avi Silberschatz.

[YSH96] Object Fusion in Mediator Systems - Yannis Papakonstantinou and Serge Abiteboul and Hector Garcia-Molina, VLDB 96.