caArray 2.5.0 Technical Guide
Topics in this document include:
- Architectural Representation
- Architectural Goals and Constraints
- Use-Case View
- Logical View
- Architecturally Significant Design Elements
- User Interface Layer
- API Layer
- Domain Model
- Application Logic Layer
- Array Platforms Layer
- Binary Data Storage Layer
- DAO Layer
- Database Layer
- Cross-cutting concerns
- Implementation View
- OSGi-based Plugin System
- caarray-install.zip and caarray-upgrade.zip
- Deployment View
This document provides a comprehensive architectural overview of the caArray system, using a number of different architectural views to depict different aspects of the system. It is intended to capture and convey the significant architectural decisions which have been made on the system.
This document describes the aspects of caArray's design that are considered to be architecturally significant; that is, those elements and behaviors that are most fundamental for guiding the construction and continuing development of caArray, both as a standalone system and in the context of the caBIG ecosystem. Stakeholders who require a technical understanding of caArray are encouraged to start by reading this document, then reviewing the caArray UML model, and then by reviewing the source code. Note that all diagrams represented in this document are taken from the caArray UML model; for more detail about the elements in these diagrams, consult the source model.
Acronyms and Definitions
DAO - Data Access Object
EJB - Enterprise JavaBeans
- JEE - Java Enterprise Edition
- JSE - Java Standard Edition
- JDK - Java Development Kit
- JPA - Java Persistence API
- JSP - JavaServer Pages
MAGE-OM - Microarray Gene Expression Object Model
MAGE-TAB - Microarray Gene Expression Tabular Format
- POJO - Plain Old Java Object
- RUP - Rational Unified Process
- UML - Unified Modeling Language
OSGi - Dynamic Module System for Java
- CSM - Common Security Module (an NCI CBIIT security framework)
Philippe Kruchten 1995, "The 4+1 view model of architecture - IEEE Software. 12(6), November 1995.
UML Model References
The diagrams in this document come from an Enterprise Architect UML model. This model is found in the docs/analysis_and_design/models directory of the caArray subversion repository. The model is composed of four root-level model packages that are each stored as a controlled package in the SVN repository. The four models are:
- Domain Model - domain_model.xml - contains some conceptual domain classes used during inception phase of caArray 2.
- Use-Case Model - use_case_model.xml - contains the principal use cases for caArray.
- Design Model - design_model.xml - contains the design elements making up the logical view for caArray.
- Deployment Model - deployment_model.xml - contains the artifacts and nodes making up the implementation and deployment views for caArray.
To access these model packages, you must create a local EA project, set up version control, and import these packages; the details on doing so are in the Working with EA Models Guide.
Throughout this document, each diagram will have below it a reference to the top-level package and the path to the subpackage within that top level package where the diagram can be found in the model.
The Rational Unified Process, Version 2003.06.13
The caArray architecture is represented in the caArray Technical Guide and in the UML design models as a set of views of the system from different but complementary perspectives. These views are:
- The caArray2project:Use Case View - Describes the functional requirements of the system.
- The caArray2project:Logical View - Describes the organization of the system design into subsystems, interfaces, and classes and how these elements work together to provide the functionality described in the use case view.
- The Process View - Illustrates the process decomposition of the system, including the mapping of classes and subsystems on to processes and threads. This view is not addressed in this document, as the standard JEE threading model is used.
- The caArray2project:Implementation View - Describes the software components that realize the elements from the logical view and the dependencies between these components.
- The caArray2project:Deployment View - Describes how the processes are allocated to hardware and execution environments and the communication paths between hardware nodes.
This style of describing software architecture is the approach recommended by the Rational Unified Process and is based on Philippe Kruchten's work, The 4+1 view model of architecture and is refined in the Rational Unified Process (RUP).
Architectural Goals and Constraints
The following factors are key considerations beyond the functional requirements that have influenced the architecture of caArray 2.0.
caBIG Silver Compliance (historical)
At the time caArray 2.0 was originally designed and architected, caBIG Silver compliance was a project requirement. While Silver compliance is being superceded by newer caBIG-wide guidelines, it is retained here as a historical note.
Remote API Usability
One of the major flaws in releases of caArray prior to 2.0 has been the requirement to use the MAGE-OM to access annotation and data. Navigation between key classes in the MAGE is inefficient, difficult to understand, and difficult to implement. The object API exposed by the new evolution of caArray is designed to be easily understandable and navigable by remote clients, whether they access the API via the grid or using a Java programmatic interface.
Storage and Retrieval
Given that data storage and retrieval is the principal functionality of caArray, array data parsing, storage and retrieval performance is key to a successful design.
The image below shows the major caArray use cases, organized by functional area. Each functional area is described in more detail below. These use cases drive the architectural design of security, validation, file management, data storage and retrieval, and API.
[UML Source: Use-Case Model -> Use-Case Model]
These use cases span registration of new users, login of existing users, and management of users and groups. User provisioning is done primarily through UPT (a separate application), but groups are managed within caArray.
Both LDAP and local database are supported as repositories of user identity data and authentication mechanisms. Integration with caGrid security mechanisms is envisioned in the future, but is not currently supported by the architecture.
These use cases deal with tasks done by the system administrator in support of other activities, such as monitoring audit logs, and changing ownership of experiments and collaboration groups.
These use cases deal with management and curation of supporting data elements that are required for experiment management and data validation import, such as protocols, controlled vocabularies (ontologies), and array designs.
It should be noted that currently, management of controlled vocabularies is entirely inside caArray and is limited to only a few ontological categories. Ontology references and ontology terms may also be created via MAGE-TAB import, but will not be editable unless they fall into these categories.
Array designs are expected to be uploaded and created as needed per caArray instance - no array designs will be pre-loaded.
An expanded curation functionality that would allow for review, correction and merging of ontological entries is envisioned for a future time, but not yet designed or implemented in the system. This functionality may include integration with external ontological repositories.
Search and Navigation
These use cases support discovery of experiments and biomaterials of interest. The primary discovery mechanisms are browsing by categories, and search based on keywords.
Currently only discovery of experiments and biomaterials on the local caArray instance is supported. Cross-instance discovery via caGrid is envisioned for the future, but not currently explicitly supported by the architecture.
These use cases concern creation and management of experiments, together with their associated annotations and data. This encompasses upload of data files, as well as validation and import of data files in order to link the microarray data they encode to the experiment and its biomaterials.
Data and Annotation Retrieval
These use cases deal with enabling the experiment annotation and data to be retrieved from the system. Download of the original data is supported, as well as the export of current experiment annotations in MAGE-TAB and GEO format.
These use cases detail the various operations that may be performed programmatically by partner applications and services, via either caGrid or Remote EJB API mechanisms. The operations are focused on allowing experiments and biomaterials of interest to be located, and associated annotations and data to be retrieved, either in raw or parsed format. As data sizes are quite large, significant emphasis must be given to architectural mechanisms that permit large amounts of data to be retrieved over the API.
The design model (from which the logical view is taken) is the most significant model, requiring the most effort and containing the majority of the content. Accordingly, the description of the logical view of caArray's architecture receives the most attention here. We first describe the structural hierarchy of the system in layers, packages, and subsystems and then describe how these elements collaborate to provide the most architecturally significant functionality. Figure 3.1 illustrates the top-level structural organization of caArray, using layers as the primary organizing concept. The major dependencies between layers, packages and subsystems are represented as well, though it should be noted that some supporting dependencies have been elided to enhance readability of the diagram.
The only subsystems that are accessible to external systems are the subsystems represented in the caArray2project:API layer - the caGrid web services in the Grid API package and the Remote EJB beans in the Remote EJB API package. All subsystems implemented in the caArray2project:Application Logic and lower layers are internal to the application and do not expose remote interfaces. The caArray2project:User Interface layer is accessible to web clients via HTTP(S). Clients of caArray can be characterized as either web UI clients or API clients.
caArray is implemented as a JEE application leveraging some core JEE and JSE technologies:
In addition, caArray incorporates a number of open source libraries, including caBIG core toolkits. Some of the key ones are:
Hibernate - object-relationship mapping
Guice - dependency injection
Struts2 - UI-layer MVC framework
Apache Felix - OSGi container
Atlassian Plugins - OSGi-based plugin framework
- CSM - authentication, authorization, and instance-level security
NCI-Commons - 5AM toolkit for NCI applications
Only EJB session and message-driven beans are employed. Persistence is managed directly with Hibernate 3.2 rather than the JPA standard. As a standalone application, the need to switch persistence providers is not anticipated, while native Hibernate provides additional features not available in the JPA standard.
The layers and their constituent subsystems are shown in the diagram below and are described in detail in the following section.
[UML Source: Design Model -> Architecturally Significant Design Elements]
Architecturally Significant Design Elements
User Interface Layer
The caArray user interface is accessed as a standard web application via HTTP(S). It is implemented as a JEE web application employing Struts 2 as the Model-View-Controller implementation. This layer provides presentation, navigation and UI-level validation functionality only. All application logic, including data model-level validation, is implemented in the lower layers of caArray.
Validation logic at the UI level is limited to form-based validation specific to a particular view (for example, checking for appropriate field formats or query parameter combination) and is implemented using Struts2 validation. Additionally, a bridge from Struts 2 validation to the Hibernate Validator framework ensures that data model-level validation (invariants such as not-nullness of fields, size, min/max elements, and so on) is exposed to the user in the same way as UI-level validation. As noted before, however, data model validation enforcement happens in the lower levels, and so is not dependent on going through the UI layer.
The User Interface layer also includes the login authentication classes CaArrayDBLoginModule and CaArrayLDAPLoginModule. These classes are LoginModule implementation used to integrate CSM authentication into the JAAS standard security model, allowing for both database and LDAP backed authentication.
caArray has two APIs: the Service API and the Legacy API.
The Service API uses an caArray2project:external data model, independent from the application domain model. This data model is defined in the caarray_service_model_v1_0.eap Enterprise Architect UML model under the docs/analysis_and_design/models/ directory in the caArray source tree. The Service API includes methods for retrieving file contents, exporting a MAGE-TAB data set, obtaining a parsed data set, retrieving an annotation set, and a variety of search methods. A translation layer, powered by Dozer , is used to transform objects between the external data model and the domain model.
The Service API is versioned independently of the application itself. Multiple versions of the API can be deployed in the same container, providing backwards compatibility for integration partners. Some other design goals and advantages of the Service API are:
- Decoupling the external model from the internal persistence model allows us to evolve the external model and API independently from the internal model, and support multiple API versions concurrently.
- The external model is also simpler compared to the internal model, focusing on the information important to client code. The internal model has aspects that are optimized for performance with a backing database; the external model dispenses with that, instead optimizing the model for clarity and serialization.
- The external model is designed explicitly for serialization, avoiding deep object graphs and loops. This means graph-cutting is no longer required for this model, and all returned objects are fully-filled in.
- The new API contains methods tailored to specific use cases, whose implementations have been optimized. This makes common tasks easier, reducing the need for the client code to have a deep understanding of the full model and also reducing the potential for non-performant queries.
The Legacy API (so named because it was the original caArray API prior to the introduction of the newer Service API) uses a subset of the internal persistence model, which is defined in the caarray_internal_model.eap UML model under the docs/analysis_and_design/models/ directory in the caArray source tree. This API offers a CQL Query implementation, which the Service API does not. The Legacy API also has a few other methods, but they are expected to be deprecated and removed in the future.
Both the new Service API and the Legacy API are available as caGrid services and Remote EJBs. Both API realizations are the same from a functional and informational standpoint, but the details of the methods differ slightly.
We have created an API Guide which goes into the practical details and specifics of the Service and Legacy API, and their realizations as caGrid services and remote EJBs. Below, we discuss the APIs from an architectural perspective.
Legacy Remote EJB API
The Legacy Remote Java API is implemented as a facade (the CaArrayServer class) representing a connection to caArray and a set of several stateless session EJBs with remote visibility. Clients instantiate a CaArrayServer instance, call the connect method and can then access the session EJB interfaces through accessor methods exposed by the CaArrayServer. These EJBs provide simplified, efficient access to caArray entities and data. Special consideration was given to the DataRetrievalService API to enable clients to retrieve only the data they require. Clients may select data for specific QuantitationTypes and Hybridizations, by configuring a DataRetrievalRequest object and passing it as an argument to the getDataSet() method. The remote interfaces and their exposed operations are shown in the class diagram below.
As discussed above, each remote Java API method performs object graph cutting to minimize data transmissions. The DataRetrievalService's cutting is more sophisticated: instead of performing cutting at the child object level, all information about the DataSet is returned in a single request.
[UML Source: Design Model -> Design Elements -> API -> Remote EJB API -> Legacy -> Legacy Remote Java API]
Legacy Grid API
The caArray Legacy Grid API is a caGrid 1.5 compliant data service along with several analytical service methods. The service was created via the Introduce Toolkit , and then modified to improve performance and add additional features. The service implementation is a thin layer that communicates, via JNDI and RMI, with a running instance of the caArray2project:Remote EJB Legacy API, and uses the latter to execute incoming service requests.
The grid service provides both the standard data query (CQLQuery) method and several analytic services. All data in caArray is available via the data service. Optimized data access for some data types is available via the analytic services; however, users of those services are encouraged to migrate to the Service API.
To perform CQL searches, the service uses the API method
exposed by the CaArraySearchService EJB. After passing the CQLQuery to the EJB API, additional transformations are applied to generate a CQLQueryResults object for the grid client. The EJB search API performs the bulk of the work for grid clients. The search method accepts the CQLQuery object and returns matching objects from the domain model, ignoring any query modifiers in the original CQLQuery. caArray uses the CQL2HQL class provided by the sdkQuery32 package from caGrid to translate the CQL to HQL, which is immediately runnable in Hibernate.
Before any object or list of objects is returned, the server performs object graph cutting on the returned objects. This cutting prevents large, fully connected object graphs from being returned to clients and potentially overwhelming network, memory, or other resources. The graph cutting first initializes the root objects and all directly associated objects. Then, for each directly associated object, the associations from those objects to their directly associated objects are all set to null. As a result, remote clients, including the grid service itself, receive a limited set of data, and enough information about the dependent objects to continue to fill out the object graph to an arbitrary depth.
The grid service receives the list of matching domain objects from the search API and transforms those results into the CQLQueryResults expected by the grid client. To assist in this translation, caArray utilizes the CQLResultsCreationUtil from the SDK. Depending on query modifiers, the system either (1) translates whole objects, (2) translates [caArray2project:unique] specific properties, or (3) returns the count of objects in the list.
One of the analytical service methods, createFileTransfer, deserves special mention. It provides efficient retrieval of the contents of a file stored in caArray, and does so by taking advantage of the Grid Transfer framework introduced in caGrid 1.2. The Grid Transfer framework provides an out of band channel for retrieving the binary data. Instead of returning the data directly and serializing it inside the SOAP response, the data is staged on the server, a WS-RF resource is created for the data, and a reference to this resource is returned to the client. The client then uses this reference to initiate a transfer of the actual data over a separate HTTP connection.
The diagram below illustrates the relationships between the classes implementing the Legacy Grid API. The bulk of the classes shown (with exception of CaArraySvcImpl and CaArrayCQLQueryProcessor) are generated by Introduce and provide the standard marshalling and query functionality of a standard caGrid data service. The CaArraySvcImpl is responsible for the actual implementation of the service methods (with exception of CQL Query), and does so by delegating to the appropriate Remote EJB API services. The CQL Query method has special status in the framework and is handled by CaArrayCQLQueryProcessor, which again delegates to the Remote EJB API for the implementation, with the additional processing described above.
[UML Source: Design Model -> Design Elements -> API -> Grid API -> Legacy -> Legacy Grid API]
Service Remote EJB API
The Service Remote EJB API is implemented in a similar manner to the caArray2project:Legacy Remote EJB API , with a facade CaArrayServer class that provides access to stateless remote EJB beans that implement the service methods. Clients instantiate a CaArrayServer instance, call the connect method and can then access the session EJB interfaces through accessor methods exposed by the CaArrayServer.
In contrast to the caArray2project:Legacy Remote EJB API , no graph cutting is performed by the Service Remote EJB API, as the external data model is explicitly optimized for over-the-wire transmission. Thus, all objects returned by the Service API are fully populated.
The Service Remote EJB API relies on the caArray2project:Application Logic layer to actually perform the operations requested, and then makes use of the translation mappings, using the Dozer library, to transform the objects from the internal model to the external data model before returning them to the client.
Several of the methods provide search capabilities that potentially result in a large number of results. This can pose performance and memory problems, due to serialization overhead and database performance. This is handled by the inclusion of a LimitOffset parameter in those service methods, allowing the client to control which subset of the result set is returned. The entire result set can then be retrieved by a succession of calls with appropriate LimitOffset parameters. In addition, to ensure performance / memory problems do not manifest, the service implementation itself has a hard limit on the number of results it will return, even if the client requests more.
Service Grid API
The caArray Service Grid API is a caGrid 1.2 analytical service with several methods. It does not include a data service. The service was created via the Introduce Toolkit, and then modified to improve performance and add additional features. Like the Legacy Grid service, the service implementation is a thin layer that communicates, via JNDI and RMI, with a running instance of the caArray Remote EJB Service API, and uses the latter to execute incoming service requests.
Some of the service methods allow access to the byte contents of files stored in caArray. This service takes advantage of the Grid Transfer framework introduced in caGrid 1.2 to do so. The Grid Transfer framework provides an out of band channel for retrieving the binary data. Instead of returning the data directly and serializing it inside the SOAP response, the data is staged on the server, a WS-RF resource is created for the data, and a reference to this resource is returned to the client. The client then uses this reference to initiate a transfer of the actual data over a separate HTTP connection.
For the search methods that can return a large number of results, two parallel sets of methods are provided. The first set uses the LimitOffset parameter to control result set size, as described in the caArray2project:Service Remote EJB API section. The second set makes use of the WS-Enum specification to provide the same capability. It returns WS-Enum resources, which then provide a standardized way of retrieving a limited number of results. The caArray implementation of WS-Enum resources only supports the maxItems parameter of IterationConstraints. The maxCharacters and maxDuration parameters are ignored.
The classes implementing the Service Grid API follow the same pattern as for the caArray2project:Legacy Grid API . The bulk of the classes are generated by Introduce and provide the standard marshalling and query functionality of a standard caGrid data service. The CaArraySvcImpl_v1_0 class is responsible for the actual implementation of the service methods and does so by delegating to the appropriate Remote EJB API services.
External Data Model
The External Data Model provides a simplified set of classes (along with companion XML Schemas) that represents a subset of the data stored in caArray and allows this data to be sent over service interfaces efficiently. The design of this data model has several characteristics:
- Bidirectionality is avoided. This allows complete object graphs to be serialized without resorting to graph-cutting hacks. When back-references are necessary, they are provided in the form of IDs.
- Only the attributes of interest to external clients are included.
The External Data Model also includes a number of Criteria classes to support specific search use-cases implemented by methods in the external API.
The diagrams below show the set of packages and classes making up the external data model, as well as the details of example key classes to illustrate the design principles described above.
[UML Source: Design Model -> Design Elements -> API -> External Model -> External Model Packages]
[UML Source: Design Model -> Design Elements -> API -> External Model -> External Model Key Classes]
This section describes the classes used to model the microarray experiment and data that caArray is designed to manage. These classes are employed by all of the caArray subsystems. A subset of these classes must also be understood by the caArray2project:Remote EJB Legacy API . Grid clients do not need the Java domain classes, since our domain model is registered in caDSR . However, grid clients implemented in Java are encouraged to use our domain classes, as they can then make use of the Castor framework for seamless deserialization of the XML into the domain classes. Classes that represent important data constructs are described in detail here.
The underlying object model is implemented as a set of POJOs that model the domain of microarray experiments and data. Whereas caArray 1.x used MAGE-OM 1.1 as the basis for the underlying object and data model, the caArray 2.x is based on a completely revised, simplified object model. Although MAGE-OM is a published standard, there are significant disadvantages in using it as an underlying object model: it is complicated to understand, inefficient to store, its structure does not permit useful object graph navigations, and many common relationships cannot be stored when complete experiment annotation is not available. For these reasons, we have chosen to produce a new, simplified object model for domain data representation.
The domain classes are principally designed to support the entity model described by the MAGE-TAB 1.0 specification. The underlying object model described by MAGE-TAB is considerably more understandable than MAGE-OM while still providing a complete enough model to support MIAME compliance. Annotation for array design elements are represented by a hierarchy of annotation classes based on the array design type. Each array design element that reports on a biological sequence is related to an instance of AbstractProbeAnnotation.
As has been noted earlier, array data needs to be represented in way that allows for efficient storage and transport when required by remote clients. caArray is designed to represent array data at two levels:
- The AbstractArrayData hierarchy represents individual data files that have been imported into caArray, describing their type and relationships to hybridizations. These are high level representations that do not contain the actual data values.
- The DataSet class and the classes it is related to by composition (HybridizationData and the AbstractDataColumn hierarchy). These classes ultimately contain the array data values, specifically as arrays of primitive or string values within the AbstractDataColumn subclasses.
The DataSet classes are used both to persist the data contained in array data files and as a container for custom data sets requested by clients. As an example, a given Affymetrix CEL file imported into the system will have a single persistent DataSet containing a single persistent HybridizationData instance that contains several AbstractDataColumn instances (IntegerColumns for CELX and CELY, FloatColumn for CELIntensity, etc.). If a caArray2project:Legacy Remote EJB API client requests the data for all CEL files within an experiment, a transient, compound DataSet is created that contains multiple HybridizationDatas where each HybridizationData is retrieved from persistent storage.
A columnar approach to data representation allows for efficient retrieval and storage when compared with a row-based representation. This columnar approach is preferable for two reasons:
- Array data files typically contain relatively few columns but a large number of rows, typically in the tens of thousands or larger. When returning data to remote clients, it is far more efficient to serialize a large array of primitives when compared to returning a large object graph.
- Clients typically require only a small subset of the columns represented by an array data file, so organizing data by column allows for much more efficient custom DataSet assembly. Clients may indicate which columns to select by specifying QuantitationTypes to retrieve.
In addition to efficient storage and transfer, this approach is also intended to meet the needs of caB2B and other tools that need everything navigable in the model (for example, require the domain model semantics - aren't aware of the data retrieval API). Making the columns themselves persistent with their data allows these clients to navigate to the raw data values while we still retain an efficient mechanism for storage and retrieval (the columns' compressed, serialized value arrays are transparently expanded on request).
The diagrams below show the full set of packages and classes making up the caArray domain model, as well as selected key classes implementing the Experiment model, array design annotations, and array data storage.
[UML Source: Design Model -> Design Elements -> Domain Model -> Domain Model Packages]
[UML Source: Design Model -> Design Elements -> Domain Model -> Experiment Significant Classes]
[UML Source: Design Model -> Design Elements -> Domain Model -> Array Design Significant Classes]
[UML Source: Design Model -> Design Elements -> Domain Model -> Array Data Significant Classes]
Application Logic Layer
This layer consists of a set of subsystems implementing the primary business logic for the application. Each subsystem consists of a primary Stateless EJB Bean serving as the Facade to that subsystem, with potentially additional helper classes used by it to implement its logic. These EJB Beans are used both by the caArray2project:User Interface Layer and the caArray2project:API Layer to perform all their operations. In particular any manipulation of persistent data is done through beans in the Application Logic Layer. This ensures that validation, transactionality, and other cross-cutting concerns are applied consistently between the UI and the API.
Below we describe some of the key subsystems in this layer. The subsystems are named after the interface for their primary Stateless EJB Facade Bean.
The ProjectManagementService subsystem is implemented as a facade to allow the user interface to create and retrieve experiments and their associated annotations. The implementation of this subsystem delegates to the caArray2project:DAO layer for entity management and to the caArray2project:FileAccessService for file management. The subsystem contents are shown below.
[UML Source: Design Model -> Design Elements -> Application Logic -> gov.nih.nci.caarray.application.project -> ProjectManagementService Implementation]
The FileAccessService subsystem is responsible for storage of all files managed within caArray (annotation, array design and data). Files that are uploaded to caArray are registered with the FileAccessService which reads the files, compresses the contents and stores the contents as BLOB(s) in the database associated with a CaArrayFile instance. Due to limitations in MySQL when storing very large blobs (>250MB), caArray breaks very large files into multiple blobs for storage in the database. The storage of multiple blobs is transparent to users of the CaArrayFile class.
Because of MySQL limitations, retrieving BLOB data from the database is expensive, in memory and performance. Therefore file retrieval is performed through the TemporaryFileCache interface. This provides clients with methods to get the data as a java.util.File, which can then be used to stream the data. Clients are expected to tell TemporaryFileCache when they are done using files so that it can remove the files from temporary file system storage, but the subsystem also performs clean up at the end of HTTP or Remote API requests, and when it is finalized. The static structure of the FileAccessService subsystem and the act of storing file contents are shown in diagrams below.
The default TemporaryFileCache implementation is thread-bound. This does mean potentially having duplicates of temporarily uncompressed files, but this should be the exception as files are only needed on download and when parsed. After weighing the potential approaches, the minor overhead of temporary duplicates was definitely preferable to the overhead of maintaining file reference counters across multiple sessions.
[UML Source: Design Model -> Design Elements -> Application Logic -> gov.nih.nci.caarray.application.fileaccess -> FileAccessService Implementation]
Whereas the caArray2project:FileAccessService subsystem handles the lower level functionality of file storage and retrieval, the FileManagementService subsystem is responsible for performing higher level logical file operations, specifically, the validation and import of MAGE-TAB annotation, array design files and array data files. The implementation of the subsystem does this through delegation to subsystems responsible for handling these various types of data. The organization of the FileManagementService subsystem is shown below, where the central bean delegates import and validation functionality to a set of helper classes, which in turn make use of other subsystems such as caArray2project:ArrayDataService .
[UML Source: Design Model -> Design Elements -> Application Logic -> gov.nih.nci.caarray.application.file -> FileManagementService Implementation]
Asynchronous Processing of Files
Array design files and array data files are huge (typically on the order of hundreds of MB to a few GB), and the processing time required to validate and import them is large (on the order of several hours in many cases). To avoid adversely impacting system responsiveness, requests to validate and import files are placed on a queue for asynchornous processing. We give a short description of how the queue currently works.
[UML Source: Design Model -> Design Elements -> Application Logic - Asynchronous Processing of Files]
[UML Source: Design Model -> Design Elements -> Application Logic - Asynchronous Processing of Files]
In the FileManagementServiceBean, when one of its methods to validate and import array design files or array data files (such as importFiles, validateFiles, importArrayDesignDetails, reimportAndParseProjectFiles) is invoked, it leads to JobQueueSubmitter.submitJob being invoked. This in turn invokes JobQueueDaoImpl.enqueue. JobQueueDaoImpl is (amongst other things) a wrapper around an in-memory queue, which for our purposes here we will call the "job queue." JobQueueDaoImpl.enqueue adds the job to the job queue, and at the same time calls JobMessageSenderImpl.sendEnqueueMessage to publish an "enque" notification message on the JMS topic called ""topic/caArray/FileManagement". This notification message is received by FileManagementMDB.onMessage, which picks up the job request from the in-memory job queue and initiates the actual processing of the validation or importation request. The job is dequeued after processing (whether processing was successful or not).
As you can see, the current ansync processing mechanism involves two queues, the in-memory job queue and the JMS topic. The former contains references to the actual job requests. The latter is used to send a notification to the asynchronous thread that actually does the processing of the job request.
Originally the asynchronous processing mechanism involved only one queue, namely the JMS topic, into which the job requests were enqueued. But later on, new requirements were introduced that required the ability to manipulate the queue and view it transparently. In particular, there was a requirement to be able to cancel a job that was already enqueued, and a requirement to be able to view the statuses of all job requests currently in the queue. As a JMS queue behaves like a black box, it was necessary to introduce a queue that was transparent to and more readily manipulated by our application code. Thus the in-memory queue was introduced. The JMS queue was retained as a way to notify an asynchronous thread to process an enqueued job request.
Some consideration is being given to improve upon the current design. In particular, the dependence on an in-memory queue means that the job requests on the queue are lost whenever the system goes down (scheduled or unscheduled). Futhermore, the current implementation relies on a single asynchronous thread to do the actual processing of the job requests. But as mentioned earlier, the processing of a single job request can oftentimes take several hours. Some steps in the processing of a validation or importation job are CPU intensive, while some steps are i/o bound (database intensive). For the steps that are i/o bound, having extra threads process additional jobs in parallel can improve throughput even on machines with a single CPU core. For the steps that are CPU intensive, additional threads and/or processes on multiple cores or multiple CPUs or multiple machines can help improve throughput. The dependence, however, on the in-memory job queue precludes additional external processes from being leveraged.
The ArrayDataService subsystem is responsible for validating and importing array data files, storing array data and retrieving array data when requested by clients. The typical order of events related to a given array data file is as follows:
- An array data file is validated using the validate(arrayDataFile : CaArrayDataFile, ...) : FileValidationResult operation. Only the generated FileValidationResult is created and persisted.
- The data file is imported using the importData(arrayData : AbstractArrayData, ...) : void operation. At this point, a DataSet and associated HybridizationData and AbstractDataColumn instances are created and populated based on the source array data files.
The actual parsing of data files is delegated to implementations of the DataFileHandler interface, which know how to parse, validate, and extract the data columns for a particular data file format. These implementations are in the caArray2project:Array Platform layer. The diagram below illustrates the pattern of delegation.
[UML Source: Design Model -> Design Elements -> Application Logic -> gov.nih.nci.caarray.application.arraydata -> ArrayDataService Implementation]
The ArrayDesignService is responsible for validating, importing, persisting, and retrieving array design annotation from the various array annotation file types. The array annotation is stored in the ArrayDesignDetails and caBIO reporter annotation structures described earlier in the section on the caArray Domain Classes package.
The actual parsing of array design files is delegated to implementations of the DesignFileHandler interface, which know how to parse, validate, and extract the reporters and features for a particular design file format. These implementations are in the caArray2project:Array Platform layer.
[UML Source: Design Model -> Design Elements -> Application Logic -> gov.nih.nci.caarray.application.arraydesign -> ArrayDesignService Implementation]
The MageTabParser subsystem is responsible for reading a set of files in MAGE-TAB format, validating the files and ultimately representing the contents of the files in the object model, based on MAGE-TAB concepts. This object model is a low-level model that mirrors MAGE-TAB structures very closely. It is decoupled from the persistence model of the application to allow the MAGE-TAB parsing functionality to be reusable outside the caArray context.
The major implementation classes are shown below.
[UML Source: Design Model -> Design Elements -> Application Logic -> MAGE-TAB -> MAGE-TAB Parser Implementation]
The MageTabTranslation subsystem of caArray is invoked to translate from the MAGE-TAB object model generated by the caArray2project:MageTabParser subsystem to the corresponding caArray2project:caArray domain model representation. It implements a set of translator classes for each MAGE-TAB document type and for shared data types (i.e. *Term*s and *TermSource*s). It contains logic that can merge the incoming data from the MAGE-TAB files with already-existing data for an experiment (as well as Terms and TermSources).
The major classes and dependencies are shown below.
[UML Source: Design Model -> Design Elements -> Application Logic -> gov.nih.nci.caarray.application.translation -> magetab -> MageTabTranslationService Implementation]
The VocabularyService subsystem is responsible for managing controlled vocabularies. The implementation of this subsystem delegates directly to the caArray2project:DAO layer for CRUD operations on the caArray2project:Domain Model entities. The data model accessed by this service represents a subset of concepts present in external vocabularies such as the MGED Ontology. These concepts provide a consistent Term and Category view of vocabularies, and allow the service to manage both locally and externally controlled vocabularies. However, at this time, external vocabularies are not communicated with directly. Instead, both local and external vocabularies are stored within the caArray database for maximum efficiency. The user interface and mage tab translation both access this service to store and retrieve terms as needed.
Array Platforms Layer
The Array Platforms layer is responsible for implementing support for specific array platforms, such as Affymetrix, Illumina, and others. This includes the ability to read files describing the array layout and annotations for a platforms, as well as files containing data from an actual instance of a hybridized array.
The Platforms layer uses the SPI (Service Provider Interface) pattern to decouple platform implementations from the rest of the application. Each platform implementation is contained in a Guice Module, which must provide implementations of two SPI interfaces: DesignFileHandler for array design files and DataFileHandler for data files. A single platform module may contain multiple such implementations, as platforms like Affymetrix and Illumina have multiple design and data file formats for different types of arrays. Each platform implementation is packaged as an OSGi plugin, as described further in the caArray2project:OSGi Plugin System section of the caArray2project:Implementation View
The rest of caArray only interacts with platform plugins via the SPI interfaces. In turn, classes in a platform plugin are provided with interfaces to the rest of caArray which limits exactly what these classes are allowed to do. This makes platform plugins more readily reusable outside of caArray.
The diagram below illustrates a single platform implementation (Genepix) and its relationship to the SPI interfaces and caArray host interfaces.
[UML Source: Design Model -> Design Elements -> Array Platforms -> Array Platforms Implementation]
Binary Data Storage Layer
The Binary Data Storage Layer is responsible for implementing support for storing and retrieving blocks of binary data. This is currently used by other parts of caArray to store raw file data and parsed array data. Note that "binary data" here is meant to indicate that the data is treated as opaque byte blocks for storage and retrieval purposes; the actual data being stored in this way may in fact be text (e.g the contents of a mage-tab annotation file)
The Binary Data Storage layer uses the SPI (Service Provider Interface) pattern to decouple the mechanism for storing data from the rest of the application. The primary concept in the SPI is a storage engine, represented by the
DataStorage interface. A storage engine is responsible for knowing how to store a block of data, to provide access to that block either as a
java.io.InputStream or a
java.io.File, and to remove the block of data. The underlying storage mechanism may use the local filesystem, a database, or a cloud storage service like S3. Note that there is no ability to modify the contents of an existing block of data. A block of data is identified by a URI. This consists of a scheme, identifying the storage engine, and a scheme-specific part, which is used by the storage engine to identify the data.
An additional concept defined by the SPI is the
StorageUnitOfWork . The materialization of a block of data as a
java.io.File or a
java.io.InputStream may require temporary resource creation, e.g. a temporary local file to hold the data, that would need to be cleaned up once the application is done working with the data block.
StorageUnitOfWork defines a contract for demarcating a session of working with Data Storage-managed data, and implementations can perform necessary resource initialization and cleanup.
Each storage engine is contained in a Guice Module, which must provide an implementation of
DataStorage and may optionally provide an implementation of
StorageUnitOfWork. This is packaged as an OSGi plugin, as described further in the caArray2project:OSGi Plugin System section of the caArray2project:Implementation View
The rest of caArray only interacts with the Data Storage Layer via the
DataStorageFacade takes care of forwarding requests to store or retrieve data to the appropriate storage engine based on the URI scheme.
In turn, classes in a storage engine plugin are provided with interfaces to the rest of caArray which limits exactly what these classes are allowed to do. This makes storage engine plugins more readily reusable outside of caArray.
The diagram below illustrates a single storage engine implementation (Filesystem-based) and its relationship to the SPI interfaces and facade classes.
[UML Source: Design Model -> Design Elements -> Binary Data Storage -> Binary Data Storage Implementation]
Transactionality and Consistency
Some storage engines may use an underlying storage mechanism that is non-transactional, such as the local filesystem. Therefore the Data Storage Layer does not rely on JTA to ensure consistency between the main caArray database and the set of storage engines. Instead, it assumes operations on storage engines may take place immediately (though it may participate in the transcation), and achieves eventual consistency by having a "reaper" thread that periodically asks storage engines to remove unreferenced blocks of data. Thus the cost is that for a limited amount of time, the storage engines may hold unnecessary data. This is a worthwhile tradeoff, given the inexpensiveness of storage in general.
There are three main aspects to this scheme:
- caArray operations that add data to the data storage subsystem will fail if a storage engine fails to add the data. However, the storage engine may add the data successfully, but the overall operation may fail. The data storage subsystem does not attempt to "roll back" the data storage addtions at that point - it leaves the new data blocks in place, with no references in caArray proper. They will be cleaned up by the "reaper" thread, as explained below.
- caArray operations that remove objects which reference data in the data storage subsystem do not attempt to remove the data from the corresponding storage engine. Instead, the data blocks are left in place, potentially with no remaining references in caArray proper. They will be cleaned up by the "reaper" thread, as explained below.
- A "reaper" thread executes periodically. It examines all existing references in caArray proper to data stored in the data storage subsystem, and instructs storage engines to remove any data blocks for which no references exist. To allow for data which may have been added by transactions that have not yet committed, only data blocks that are older than the longest possible transaction length will be removed.
This algorithm ensures eventual consistency between references in caArray proper and data in the storage engines with the simplest application logic.
Most browsers will upload files using standard multipart http posts. However, this is subject to a couple limitations:
- The default struts file upload handler has a maximum size limit of 2GB
- Browsers have their limitations on the size of http content, ranging from 2-4+ GB depending on the browser.
The uploads on the UI are handled by the jQuery File Upload plugin, which uploads each file in a separate request. Thus the limit applies per file, rather than to the combined total for a set of files being uploaded. The browsers subject to this 2GB per file constraint include:
- Firefox 3.x and earlier.
Chunked file uploads are used for browsers that support XHR file uploads and the Blob API. Any file larger than the maximum chunk size (currently set to 1.5MB) is broken into multiple chunks, which are sent separately. If the upload is interrupted, reuploading that file will simply resume from the last chunk. The system determines if the same file is being resumed by looking for a partially uploaded file with the same project, filename, and filesize. Browsers that support chunked uploads include:
- Firefox 4+
Multiple file selection for uploading from the "Add Files" button has been disabled for Safari on Windows, due to a bug in that specific browser and platform. You may still multiple files by dragging and dropping.
Data downloads are packaged as zip files for sizes less than 2GB. Due to limits of the zip file format, sizes larger than 2GB will be downloaded as tgz files.
Imports are limited by server memory available and the MySql transaction size limit of 4GB. As a result, imports of some large datasets are unable to complete within a single transaction. To resolve this issue, the SDRF file is split up and each line is imported in a separate transaction.
An executable job called ProjectFilesSplitJob was introduced. This breaks up filesets by spawning a new ProjectFilesImportJob for each data file and having them import independently. However, this is mostly transparent to the user, as only the parent job will appear on the UI.
Upon interruption of one of these parent jobs, only the incomplete child jobs will be cancelled. To resume, the user may simply reselect the remaining unimported files.
caArray uses the standard Data Access Object pattern to provide data updates and retrievals. The DAOs are exposed as Java interfaces whose implementations are injected into the service beans using Dependency Injection. The implementations of the DAOs use Hibernate 3.2 as the underlying persistence mechanism. DAO classes are provided for the core Domain classes, but there is not a strict one-to-one domain class to DAO class correspondence.
The caarraydb component represents the database used to store all of caArray's persistent data. caArray currently supports MySQL version 5.1 and uses InnoDB tables for transactional behavior. Generally, caArray strives to use only standard SQL to reduce coupling to MySQL specifically, and furthermore to funnel all interaction with the database via Hibernate. However, in certain places, specific MySQL features are used to work around performance limitations. The primary place MySQL-specific features are used is in the security filters described earlier. Thus, migrating to another RDBMS, while feasible, would entail considerable work.
caArray uses Liquibase to populate and upgrade the database. The original set of population scripts were generated directly from Hibernate annotations in the domain classes at the time. For any subsequent changes to the domain model, liquibase changesets are created and added to the changelog. This ensures that a database can be upgraded automatically from any previous state to the current state. The changesets are created manually, but are based on what Hibernate generates automatically from the annotations, to make sure the annotations and the schema are consistent.
Authentication and Authorization
caArray allows experiment owners to define fine grained access constraints on both whole experiments and individual samples (and the biomaterials and data derived from those samples). Non-experiment owners (including anonymous, non-logged in users) can be given access to a small set of summary information about an experiment. Full read access to experiments and/or samples can be granted to the public as well as to defined groups of users (known as collaboration groups). The collaboration groups can also be granted write permissions to experiments and/or individual samples. Finally, an experiment can be removed from visibility entirely, making it completely inaccessible to users who have not been granted special permissions as described above.
This permissions system is implemented via integration with CSM 4.2. CSM provides a rich, fine-grained domain model for expressing security constraints, including instance and attribute level security. The concepts described above map nicely onto the classes available in CSM in a very natural way. The architecturally interesting points about the integration, described below, involve synchronization between the caArray and CSM data models and the enforcement of the security constraints defined in the model.
Synchronizing the caArray and CSM data models requires creation and modification of CSM data structures expressing appropriate security constraints in response to corresponding operations on the caArray data model. This is accomplished via SecurityInterceptor, which takes advantage of a Hibernate API that allows application code to respond to Hibernate lifecycle events. SecurityInterceptor detects creation, modification and deletion of caArray domain objects and in response creates or modifies the CSM data structures which store security constraints on those objects.
Enforcement of the security constraints is done in two ways. Hibernate filters are used to enforce read permissions and visibility control for experiments. The filters are defined for any caArray domain classes which are covered by the security model, and act as essentially additional WHERE clauses that limit any queries against those classes to instances to which the user has access. These filters are applied transparently by Hibernate, and are automatically parameterized by CSM with the current user. This provides for a clean separation of concerns, as business logic can be written without the clutter of security considerations.
To enforce write permissions, we instead use the API provided by CSM's AuthorizationManager class. The logic for doing so is centralized in the SecurityUtils class.
We also take advantage of new features introduced in CSM 4.2 to work around performance issues exhibited by previous versions of CSM. Rather than basing the filters on the generic CSM tables, special caching tables are created for each entity type that has access controls applied (experiments, samples and collaboration groups in caArray). These caching tables are optimized for subquery performance, so filters against them execute much faster. The caching tables are refreshed whenever the underlying canonical tables are modified, e.g. when new experiments are created, via the use of the SecurityInterceptor mentioned above.
It is important to note that security constraints are checked twice. First, during display of data, security constraints are checked to determine whether to display certain user interface elements. For instance, on the Work Queue page, the edit link is only displayed for an experiment if the current user has write permissions to the experiment. Second, security constraints are checked and enforced before any actual operations against protected data are performed. This ensures that the user interface shows the user only the actions they have permissions to perform, but still enforces those permissions if a malicious user circumvents the normal user interface (for instance by URL hacking).
caArray currently implements an audit log of changes made to some key entities in the data model. Auditing is implemented using AuditLogInterceptor, which (like the SecurityInterceptor described in the security section) takes advantage of the Hibernate API allowing code to be executed in response to Hibernate lifecycle events. This interceptor responds to save and update events, and, if the entity in question implements Auditable, saves a record of the changes to an audit log.
Dependency Injection and Service Locator
caArray currently uses a mix of the Dependency Injection (DI) and Service Locator patterns to manage locating implementations of needed interfaces, but with a long-term goal of replacing all Service Locator usage with DI usage. Currently, DI is used to wire DAO classes, Platform SPI classes, and general helper classes into the EJB service beans and each other. The Service Locator pattern is used to obtain instances of the EJB service beans.
The major physical artifacts that comprise the caArray software deployment units are illustrated below. The major artifacts and their relationships to the subsystems they realize are shown below and described in the following section.
[UML Source: Deployment Model -> Architecturally Significant Implementation Elements]
[UML Source: Deployment Model -> Artifact Manifestations]
OSGi-based Plugin System
caArray uses Atlassian Plugins , an OSGi-based plugin framework, to decouple some functionality to plugins which can be deployed independently of the application proper. Currently platform implementations (as described in the caArray2project:Array Platforms Layer) and storage engines (as described in the caArray2project:Binary Data Storage Layer) can be packaged as OSGi plugins.
The Atlassian Plugins framework is initialized during application startup and scans a particular directory for plugins, automatically deploying each plugin whenever it appears in that directory. Currently plugin removal functionality is not supported, but a newer version of a plugin will automatically be deployed.
In the Architecturally Significant Implementation Elements diagram above, and the artifact descriptions below, we show a single example of an array platform and a storage engine platform; these should be taken as representative of a number of such plugins that could be present.
The caarray.ear artifact is the J2EE Enterprise Application Archive (EAR) that contains all of the web portal application and EJB components that make up the caArray2project:User Interface, Remote EJB API, caArray2project:Application Logic, caArray2project:Data Access and caArray2project:Domain Model layers of the application. The EAR also contains the third-party JARs necessary to support these layers.
The caarray.war artifact packages the JSPs, Struts 2 classes (actions, converters) and other Servlet API-based supporting classes (Filters, Session and Application Listeners, etc) that comprise the caArray2project:User Interface layer of caArray. The WAR also contains the Struts 2 third-party JARs and necessary supporting JARs.
The caarray-ejb.jar packages the implementation of all of the EJB subsystems and the Remote API (Remote EJB). This includes all of the subsystems in the caArray2project:Application Logic Layer of the logical model. An important third-party dependency to note is the dependency on AffxFusion.jar which provides Affymetrix file format parsing support.
The caarray-common.jar contains the caArray2project:Domain Classes packages, the caArray2project:Data Access subsystem and the MageTabParser subsystem. The major third-party component dependencies noted are hibernate3.jar and hibernate-annotations.jar to support annotation-based Hibernate ORM mapping and to csmapi.jar to support entity access authorization.
The caarray-client-legacy.jar contains the remote EJB interfaces required by caArray2project:Legacy Remote EJB API clients and the caArray2project:Legacy Grid API clients. This JAR also repackages other third-party classes required by remote clients.
The caarray-client-external-v1_0.jar contains the remote EJB interfaces required by caArray2project:Service Remote EJB API clients and the caArray2project:Service Grid API clients. This JAR also repackages other third-party classes required by remote clients. Note that this artifact would be repeated for other versions of the Service API that are supported by a caArray installation.
The CaArraySvc-service.jar contains the caArray2project:Legacy Grid API implementation classes.
The CaArraySvc-service.jar contains the caArray2project:Service Grid API implementation classes. Note that this artifact would be repeated for other versions of the Service API that are supported by a caArray installation.
The affymetrix-platform-plugin.jar contains the Affymetrix platform implementation, packaged as an OSGi plugin. The atlassian-plugin.xml descriptor file is used by the plugin system to load the platform module and hook it into the caArray2project:Array Platforms Layer.
The filesystem=storage-engine-plugin.jar contains the local filesystem-based storage engine implementation, packaged as an OSGi plugin. The atlassian-plugin.xml descriptor file is used by the plugin system to load the storage engine module and hook it into the caArray2project:Binary Data Storage Layer.
The db-changelog.xml is the master file for all database changesets and is used to drive database upgrades.
caarray-install.zip and caarray-upgrade.zip
These are installation packages enabling caArray to be installed locally from the command line, at cancer centers or other institutions. They package together the caArray web application, the caArray Grid APIs and the database changelog. caarray-install.zip is used for fresh installs, while caarray-upgrade.zip is used for updating an existing caArray installation to a new version.
A key dependency is on bda-utils.jar, which provides a set of Ant macros to help with installation tasks, such as configuring JBoss and executing database upgrades.
The caarray-gui-distribution.jar is a graphical installation package, enabling caArray to be installed locally, at cancer centers or other institutions. It is basically a graphical front-end to caarray-install.zip and caarray-upgrade.zip; it collects configuration parameters via a series of screens and then executes the appropriate command-line installer to install or update caArray.
A key depedency is on izpack.jar, which provides the graphical installer capability.
The typical deployment configuration for caArray at a cancer center is shown below. Currently, the web application and the grid service are hosted in separate application servers, both running JBoss 5.1.0. The Globus runtime is packaged inside the wsrf.war artifact deployed in JBoss 5.1.0, along with the caArray grid services. All service versions in use are hosted within the same wsrf.war.
The OSGi container used, Felix, runs within the web application JBoss instance, and is used to host the plugins.
The LDAP server is optional, and needed only if integration with a directory service for authentication is desired. A configuration option controls whether this is done. The SMTP server is required, and is used to send user registration confirmation messages.
[UML Source: Deployment Model -> Local Installer Deployment]
The CBIIT deployment of caArray is similar to the cancer center scenario, with the addition of a front-end Apache web server. The web server mediates both HTTPS requests from a web client and SOAP invocations from a Grid client, providing SSL support and firewall functionality.