4 - Semantic Infrastructure Requirements

To be provided.

This section includes the the high-level semantic requirements derived from the use-cases. The semantic requirements provides a framework for a detailed description of services in the architecture section.

The following is summary of the sub-sections:

Semantic Infrastructure Users and Roles
Functional Requirements
- Artifact Management
  - Static Models
  - Behavioral Models
  - Forms
  - Specification Content
- Services Lifecycle Management & Governance
  - Discovery
  - Lifecycle Management
  - Governance
- Case Report Form Modeling
  - Form template authoring
- Conformance Testing
- P/S/T & Terminology Integration

Requirements Analysis

This section presents the dervied requirements as a result of the requirements analysis of the use cases presented in previous section. The analysis includes tracing of requirements up to the use case and stakeholders and down to service capabilities specified later in this document.

Semantic Infrastructure Consumers and Roles

The semantic infrastructure is expected to address the needs of a broad group of stakeholders. The semantic infrastructure as defined in this section provides foundational specifications and capabilities for the following key users:

Clinicians
Model Developers
Service Developers
Service Architects
Service Analysts
CBIIT Enterprise Architecture Governance
Vendors
Platforms, including caGrid 2.0
BioInformatics Specialists

Functional Requirements

This section includes:

Artifact Management
Service Lifecycle Management and Governance
CRF Modeling
Conformance Testing
P/S/T & Terminology Integration

The requirements listed above address one or more use cases in each domain. In addition to the domain specific use-cases, the requirements also address CBIIT's internal development and architecture requirements. Specifically, CBIIT has stardardized on Services Oriented Architecture as the foundational principle for applications architecture and interoperability. CBIIT has also adopted a formal approach to defining service specifications, supporting both interoperability, and the need to publish formal specifications that can be adopted by external organizations and vendors.

The following sections provide detail on these categories of requirements, defining the requirement as well as describing the relevance to our primary and secondary use-cases.

Artifact Management

Artifacts includes support for different formats of models, both static and dynamic. Artifact management also includes the ability to manage content and clinical documents. A service specification is made up of service metadata, artifacts and the metadata supporting these artifacts. Artifact management primarily deals with managing artifacts lifecycle and authoring of artifact metadata.

Static models include (but not limited to):

XML Schemas
UML/HL7 Models
OWL
Meta Models
Transforms
Model Constraints
Data Types

Dynamic models include (but not limited to):

HL7 SAIF behavioral model
Orchestrations & Workflows
Rules - Drools, etc.

Content

Service specification content, primarily unstructured text
Images and other representations of static content

Forms

Form Templates
Form Defintions

Artifact lifecycle management and metadata provides the ability to:

Manage lifecycle/governance/versioning of the models, content and forms.
Establishing relationships and dependecies between models, content and forms
Provenance, Jurisdiction, authority and intelluctual property
Representation and views of the information, realized through the appropriate transforms
Access control and other security constraints
Annotations
Usage
Represenations
Terminology and Value Set binding

The artifacts are bound to the services via the service metadata, the service metadata combined with the artifacts and supporting metadata provide a comprehensive service specification.

This service specification is used to describe all aspects a foundational requirements group that allows a service developer to build higher-level services (for example, a service in the NCI Enterprise Service Inventory) that are utilized in the use case. This group of requirements allows a primary user (service or application developer) to provide business logic as a service to the broader enterprise. This group of requirements focuses on specification-driven configuration of policies, security requirements, and metadata of the service, and a development process that allows service developers to focus on business logic.

Link to use case: service based capabilities for image databasing, image annotation creation capabilities, search and query, patient electronic medical record markup, glioblastoma recommended treatment analysis, and other capabilities in the use case.

Use Cases Addressed

caEHR

Service Lifecycle Management and Governance

Service deployment requirements include instance-specific configuration of service policies, business logic configuration, security configuration, and local or remote (for example, cloud) deployment. The requirements also include configuration of instance-specific service metadata, advertisement, and publication of a service to the broader ecosystem for re-use.

Link to use case: the services may be located at an institution or hosted externally by service providers.

CRF Modeling

Discovery includes service discovery, data discovery, and policy discovery. Service discovery allows primary users as well as secondary users to locate a service specification and instances based on attributes in the service metadata (for example, via a search for specific micro-array analysis services). Data discovery enables secondary users to find the types of data available in the ecosystem as well as summary-level information about available data sets. Policy discovery allows application developers to find and retrieve policies on services.

Link to use case: As institutions share de-identified glioblastoma data sets, they are available to others via data discovery. The treatment recommendation service used by the oncologist is able to discover these new data sets and their corresponding information models, and include that data for subsequent use in recommendation of treatment.

Conformance Testing

Human semantics include metadata about a service that is meant to be displayed via a user interface, for example, a description of the operations defined on a service. Computable semantics are metadata that are added to a service primarily in order to facilitate service orchestration and choreography, and to specify precisely the semantic meaning of data in order to allow interpretation and reasoning. Services in the ecosystem must have both types of metadata in order to facilitate tools for the platform and enable working interoperability.

Link to use case: Image analysis as services will need to adequately describe the actions image analysis performs and the required input and expected output, so a human or a computer may discover appropriate analysis algorithms to be used on an image.

Service Utilization

This group of requirements focuses on enabling developers of composite services and applications to discover, compose, and invoke services. This includes the discovery of published services based on service metadata and the generation of client APIs in multiple languages to provide cross-platform access to existing services. This also includes the ability to use an "analytical" service locally in the case where the data to be processed is too large to move to a remote service.

Link to use case: all of the data management and access services in the use case are utilized by application developers to build the user interfaces that the clinicians use during the course of patient care.

Service Orchestration and Choreography

Service orchestration and choreography allows both application developers and non-developers to discover service "building blocks" that can be composed dynamically to provide business capabilities. Special cases include the orchestration of multiple services for a distributed query, or for a transactional workflow. Service orchestration and choreography will leverage static and behavioral semantics from the Semantic Infrastructure v2.

Link to use case: Federated query over the TCGA data and other data sets is performed using a service orchestration.

Policy and Rules Management

Policy and Rules Management allow non-developer secondary users to create policies and rules and apply them to services. The scope of policies includes, but is not limited to, definition and configuration of business processing policy and related rules, compliance policies, quality of service policies, and security policies. Some key functional requirements to manage policies include capabilities to author policies and store policies, and for approval, validation, and run-time execution of policies.

Link to use case: Each institution has different data sharing needs, access control needs, and business rules for processing that are defined and customized. For example, policy at the pathologist's institution may state that the patient is scheduled for a visit when the review is complete.

Event Processing and Notifications

Event Processing and Notifications enables monitoring of services in the ecosystem and provides for asynchronous updates by services, effectively allowing a loose coordination of services that both provide and respond to conditions (possibly defined in business rules).

Link to use case: As patient care proceeds, the system notifies the designated clinicians that data (for example, images) are ready for review. Similarly, when notifications are received, event processing logic allows the appropriate parties to assign clinicians for care. In order to facilitate better treatment (a learning healthcare system), as new de-identified glioblastoma data is made available, notifications are sent that could indicate a recommended change in the treatment plan.

Data Requirements

This section includes the following requirements:

Data representation and information models
Data management
Data exploration and query
High-throughput data
Provenance
Data semantics

Data Representation and Information Models

This set of requirements includes providing an application developer with the ability to define application-specific data elements and attributes (for example., defined using ISO 21090 healthcare datatypes) and an information model that defines the relationships between these data elements and attributes and other data elements and attributes in the broader ecosystem. In particular, the last requirement suggests linked datasets, where application developers can connect data in disparate repositories as if the repositories are part of a larger federated data ecosystem. Additional requirements include the ability to publish and discover information models. Support is needed for forms data and common clinical document standards, such as HL7 CDA. To support the use of binary data throughout the system, the binary data must be typed and semantically annotated.

Link to use case: The pathology, radiology and other data have various data formats which must be described, and the information model for the patient record must link between these various datatypes. The complete information model includes semantic links between datasets to build a comprehensive electronic medical record. Annotations on data are defined and included in the information model.

Data Management

Data management includes linking of disparate data sets and updates of data across the ecosystem. Data updates may include updates to multiple data sources, necessitating the need for transactions.

Link to use case: the patient has an electronic medical record that spans multiple institutions. The clinical workup data (for example, genomics and proteomics data) is linked to the clinical care record; similarly pathology and radiology findings must be attached to the patient's electronic medical record.

Data Exploration and Query

The wealth of data must be accessible, resulting in the need for exploration of available datasets. This includes the ability to view seamlessly across independent data sets, allowing a secondary user to integrate data from multiple sources. In addition, the query capability must support sophisticated queries such as temporal queries and spatial queries.

Link to use case: The oncologist must be able to quickly find glioblastoma data sets, indicating the fields that he is interested in comparing from his clinical data in order to find similar disease conditions and associated treatment plans. Temporal queries allow clinicians to identify changes in patient condition and treatment over time.

High-Throughput Data

An extremely important data requirement is to store and access emerging large data sets (for example, next-generation sequencing data). The key non-functional requirements in this area are efficient storage and access of enormous amount of data, potentially via streaming, and potentially performance of computation at the location where the data is stored, if the volume of data is too large to be transferred. As much of this data is binary data, this forms the requirement for a standards-based approach to binary data transfer.

Link to use case: High-resolution digital images must be transferred to other sites during review.

Provenance

Provenance encompasses the origin and traceability of data throughout an ecosystem. This is a clear requirement directly from the use case in order to ensure that all steps of patient care and research are clearly linked via the patient record.

Link to use case: The origin of data is tied to the data creator, allowing the oncologist performing the match against TCGA data and other datasets to include and exclude data sets based on their origin.

Data Semantics

In a diverse information environment, semantics must be used to clearly indicate the meaning of data. This requirement is expected to be addressed by the Semantics Infrastructure, although there will be a touchpoint between the caGrid 2.0 and the semantics infrastructure to annotate data with semantics. Integration with the semantics infrastructure will enable reasoning, semantic query, data mediation (for example, ad hoc data transformation) and other powerful capabilities.

Link to use case: The oncologist accesses the TCGA database to search for de-identified glioblastoma tumor data that is similar to the patient data exported from the hospital medical record. During this search, the semantics of the data fields are leveraged to indicate matches between TCGA data fields and the hospital medical record data fields.

External Data Repositories

There are numerous data repositories on the web today. These data repositories contain essential information that must be accessible to services in the ecosystem. As a result, caGrid 2.0 must provide capabilities to integrate these external repositories into the Grid with the assumption that the remote service cannot be changed.

Link to use case: The oncologist searches both TCGA glioblastoma data as well as de-identified data that has been added by care providers around the country. The additional data sets are external data repositories.