Skip Navigation
National Cancer Institute U.S. National Institutes of Health www.cancer.gov
NCI Wiki New Account Help Tips
Skip to end of metadata
Go to start of metadata

caArray API Guide

To Print the Guide

You can create a PDF of the guide. For instructions refer to the tip Printing multiple pages. If you want to print a single page, refer to Printing a page.

This guide includes the following topics:

Overview

This guide provides an introduction to the API offered by caArray. It describes the two distinct APIs caArray has, as well as the two platform bindings for each API. It also contains a summary of the methods available in each API, identifies some common patterns in the method design, and offers some tips on how to achieve the best performance and write the most concise client code for both.

Both APIs are currently read-only APIs. One can group the functionality of each API into these broad categories:

  • Raw data retrieval
  • Parsed data retrieval
  • Search

Legacy API and Service API

caArray offers two distinct APIs. The two APIs operate on two distinct, though related data models.

The Legacy API operates on the internal domain model of caArray. This means that when changes are made to the internal model, clients must upgrade in lockstep to support the new data model. A given instance of caArray will only support a single version of the Legacy API. To ensure that objects can be safely serialized, a graph-cutting algorithm is applied prior to instances of the internal model that are returned by the Legacy API; this is described in detail in the Legacy API Reference.

The Legacy API provides four methods.

  • The first is the CQL Query method that implements the data service specification of caGrid. This method accepts a CQLQuery and returns a CQLResultSet containing all results matching the query from the data store.
  • The remaining three methods provide for retrieving file contents, obtaining a parsed data set, and retrieving array design details. These three methods are likely to be deprecated in a future caArray release.

The Service API operates on a separate "external" data model distinct from the internal model. This external model can evolve independently from the internal model. This provides for backwards compatibility: when a new version of the Service API is released, older versions will continue to be operational, and clients need not modify their code in lockstep. We expect that instances of caArray will support at least the three most most recent Service API versions, giving clients plenty of time to work on upgrading. In addition, the external model was specifically designed with serialization to a wire-transferrable form in mind; cyclic relationships between classes are avoided, and all objects are serialized without the need for object-graph cutting to be applied.

The Service API contains methods to retrieve file contents, retrieve a live mage tab data set for an experiment, retrieve a parsed data set, retrieve an annotation data set, and several methods for searching various entities, such as experiments, biomaterials, files, and so forth.

API Versioning Scheme

Because the Legacy API is tied to the internal model, it is versioned in the same manner as the overall caArray application. This guide refers to version 2.3 of the Legacy API. Note that prior to version 2.3, this was the only API offered by caArray.

The Service API, since it is decoupled from the internal model, is versioned independently. This guide refers to version 1.0 of the Service API.

Choosing an API

When choosing which API to use, the decision comes down to whether you need to use the CQL Query method. The Service API does not support CQL query; therefore, if you require this, then the Legacy API is the only choice. However, the Service API does contain several powerful ways to perform the most common searches: criteria searches, keyword searches, and search by example. These have been optimized for their specific use cases, and therefore are simpler to use and faster than equivalent CQL queries. Thus you may only truly require CQL query if you are writing either a generic CQL-based client or framework, or have an esoteric query that cannot be expressed through any of the Search API query methods.

If you do not need to use CQL queries, then we strongly encourage you to use the Service API. The other methods in the Legacy API are provided primarily to ease migration of existing clients, and are likely to be deprecated in the next release and removed entirely in a subsequent release.

caGrid and Remote EJB API bindings

Both APIs can be accessed through two technology bindings: as caGrid services, and as remote EJB beans. The set of functions in both APIs is equivalent for both of the bindings. However, while for some functions this equivalence is complete to the point that the method signatures in both bindings are identical, that is not the case for all functions. This is due to some technology-specific mechanisms used by the different bindings. Specifically:

  • For file content transfer, the Remote EJB makes use of the RMIIO open source library, which permits streaming data over an RMI connection. The caGrid binding takes advantage of the caGrid Transfer mechanism for the same purpose. Both are described below.
  • Both Grid and Remote EJB bindings allow specifying limit / offset parameters for most search methods. However, the caGrid binding offers additional variants of those search methods that make use of the WS-Enum specification to control result iteration

Remote EJB Binding Details

This section describes some aspects specific to the Remote EJB binding.

EJB3 / RMI implementation

The Remote EJB Binding relies on the JBoss implementation of EJB3 and RMI remoting. Appropriate JBoss libraries must be present on the classpath. These libraries are included in the caarray-client.zip distribution.

RMIIO streaming

For file content retrieval in the Service API, the RMIIO open source library is used to provide the ability to stream file contents over the RMI transport. This is necessary because files are often quite large, and returning the contents of the file as a byte array (as is done in the Legacy API file retrieval method) can cause out of memory errors on both the client and the server side.

The RMIIO library, while using the RMI transport, opens its own port. This port must be accessible to the client, in the same manner as the primary RMI port for EJB3, which may require a firewall exception. The port used by the RMIIO library is configurable at installation time. Refer to the installation guide for details.

To properly use the streaming API, a few rules must be followed:

  • The RemoteInputStream must always be closed, in a finally block. The paradigm for doing so is shown below:
  • If the operations to be performed on the file are computationally expensive, then the file contents should be fully read and saved in a temporary location first, and the remote input stream closed. The file contents can then be processed from this temporary location. This conserves resources on the server.
  • If the file size is small enough, then its contents can be read into memory. For maximum efficiency, an array large enough to hold the entire file should be allocated (using metadata for the file which includes its size).
  • If the file size is large, then storing it into a byte array allocated from the heap may cause out of memory issues. In this case, it should be saved into a file instead.

Authentication

The Remote EJB binding allows clients to retrieve non-public data. To do so, the client must authenticate using a particular user's credentials. Any queries executed after doing so will include all data to which that user has access. The authentication is made against the same identity store as the web application's authentication.

Below is an example of how to perform an authenticated search.

This performs a search for all experiments to which the user "caarrayuser" has access.

caGrid Binding Details

This section describes some aspects specific to the caGrid binding.

Usage of the caarray-client.jar

caGrid is a language-agnostic specification. caGrid services are ultimately SOAP services whose method signatures, inputs and outputs are described via WSDL and XSD documents. It is possible to write a client of the caGrid binding of the caArray API that deals with the raw XML. However, if the client is to be written in Java, the preferred approach is to make use of the provided caarray-client.jar. This encapsulates the SOAP API behind a Java facade, and takes care of marshalling and unmarshalling XML into a Java object model, which simplifies working with the API considerably. The Java object model is the same one as used by the Remote EJB binding.

For further convenience, the Service API provides a set of interfaces that hide some of the API complexity and abstract the technology binding details. These interfaces are described in the Service API reference.

caGrid Transfer

Both the Legacy API and the Service API use the caGrid Transfer mechanism for providing file contents retrieval. caGrid Transfer works around issues of large binary data retrieval in caGrid by using an out-of-band HTTP connection to actually transfer the data.

One note to keep in mind about using caGrid Transfer is that it creates a resource that expires. The default expiration time is 30 minutes. Therefore, if transfer is expected to take longer, the timeout value for the resource should be increased. Conversely, after the transfer is complete, the resource should be destroyed so it can be reclaimed immediately.

The caGrid Transfer homepage on the caGrid wiki provides detailed documentation on the general design of caGrid transfer, and on usage of specific features such as resource expiration.

WS-Enum

WS-Enum is a specification for creating a stateful server-side resource representing the results of an operation, and retrieving those results in chunks. The caArray Service API contains methods returning references to WS-Enum resources as alternatives to search methods which take LimitOffset parameters. Generally speaking, the methods taking LimitOffset parameters offer more fine-grained control over the returned results and should be preferred; the WS-Enum variants are offered for use in generic tools that support that standard.

The caArray implementation of WS-Enum resources only supports the maxItems parameter of IterationConstraints. The maxCharacters and maxDuration parameters are ignored.

The WS-Enum homepage on the caGrid wiki has detailed documentation on the general design of the WS-Enum implementation within caGrid.

Authentication

The caGrid binding of either caArray API currently does not support the various caGrid authentication / security mechanisms. Only anonymous access is supported, and thus only public data may be retrieved.

Choosing a Technology Binding

In general the caGrid platform is the preferred semantically interoperable SOA platform for caBIG applications. It is therefore the preferable technology binding, and will receive the most attention going forward. However, certain current constraints of both the platform and its implementation in caArray lead to situations where the Remote EJB binding is the better choice for certain use cases. In this section discusses the considerations that drive this decision. There four primary such considerations: 1) the language in which the client is written, 2) the network environment, 3) the need for secure data access, and 4) the need for caGrid infrastructure.

Implementation Programming Language

If the client is to be written in a language other than Java, then the caGrid binding is essentially the only choice, as the remote EJB binding uses the Java-specific RMI protocol. On the other hand, if the client is written in Java, then using the Remote EJB binding is generally quite a bit more performant, as it uses a binary serialized form which is more compact and faster to marshall / unmarshall then the XML serialization used by the caGrid binding.

Network Environment

The caGrid binding operates over the HTTP protocol, whereas the Remote EJB binding operates over the RMI protocol. The former is much more commonly supported by corporate network configurations, whereas the latter may require a custom firewall exception (on either the client or the server side) that may not always be obtainable. If the latter is a factor, then the caGrid binding may be the only choice.

Secure Data Access

The caGrid binding currently only offers anonymous access, returning only publicly available data. On the other hand, the Remote EJB binding supports authenticated access, and thus allows access to users' private data if the client logs in with that user's credentials. Thus, if authenticated access is required, the Remote EJB binding is, at present, the only choice.

caGrid infrastructure

If the client needs to make use of other aspects of the caGrid infrastructure, for example the discovery capability for locating caArray instances, then naturally the caGrid binding is the only choice.

In summary, the remote EJB binding should be used if you require authenticated access, or if the client is written in Java, will generally communicate with only a specific caArray instance, network issues are not a factor, and performance is critical. In other cases, the caGrid binding should be preferred.

API Reference

The previous section describe the broad principles behind the APIs. For information regarding the particular methods each API makes available, with usage notes and code examples, as appropriate, refer to the caArray Legacy API v2.3 Reference and the caArray Service API v1.0 Reference respectively.

Labels
  • None