NIH | National Cancer Institute | NCI Wiki  

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents


A. Domain User Stories


  1. Search  for all "pre-cancerous" biospecimens that are available for sharing at Washington University, Thomas Jefferson University, and Holden Comprehensive Cancer Center.
  2. Identify samples obtained for Glioblastoma multiforme (GBM) and the corresponding CT image information.


  1. Determine if each sample used in an expression profiling experiment is available for a SNP analysis experiment


  1. .


  1. Search for a particular gene based on the Entrez Gene ID and its related information  e.g. messenger RNA and protein information from GeneConnect.

5. Using inference to identify analytical steps in a pipeline


6. Automatically discover analytical steps using Illumina bead array analysis using inference based on the semantic metadata of the parameters.

2nd paragraph: how KR uses inference and service specification models


7. Support patient to trial matching through the use of computable eligibility criteria

Simple extension of systems and their information models

caTissue is application for managing a biospecimen repository. It supports searching for biospecimens across multiple, distributed repositories. The caTissue information model provides a common view of the data that is contained in these repositories. However, this information model must be considered "underspecified" since details about biospecimens and patient encounters depends on the particular disease(s) that is being studied. Therefore, caTissue provides a way to extend the basic information model by defining new kinds of information that can be associated with patients, specimens, and "collection groups." Furthermore, to support searching for biospecimens across repositories, these new kinds of information must be 1) published, 2) defined in a way that others can understand it, and 3) allow for executing searches that use this information. caTissue supports items 2 and 3, but item 1 is hampered by the current centralized nature of caDSR. Information model extensions cannot be easily shared.


8. Support the addition of data elements to an existing information model and automatically capture and publish the information about the extensions.

9. When defining new datasets for caIntegrator's data-warehouse for biomedical data collection and analysis


, automatically record these new datatypes


in a well-defined and federated manner so that data can be shared.



B. Forms Stories

Support data collection form creation and reuse

Forms provide a convenient paper-like electronic mechanism to capture data in a structured way.  For example, when a patient is placed on a clinical trial, data about the patient's demographics and eligibility for the trial need to be captured.  Forms are often reused in their entirety or in part across organizations, such as in multi-site clinical trials.  The metadata repository must be able to support the creation and sharing of data collection forms.

Allowing form annotations to enable form behavior

Forms provide a convenient paper-like electronic mechanism to capture data in a structured way.  For example, when a patient is placed on a clinical trial, data about the patient's demographics and eligibility for the trial need to be captured.  However, forms can also exhibit specific behavior that may or may not be reusable.  These include skip patterns (if the answer to question 10 is "Yes" then skip to question 15), derived values ("what is your age" and "is your age less than, greater than, or equal to 65), and composite answers ("check all" or "more than one of the above").  Furthermore, specific requirements about how a form is rendered can exist.  For example, the question description, help text, valid values, maximum and minimum answer length, the format of a data mask (such as SSN), etc. It is important to be able to allow for forms to be annotated with this behavior such that tools can appropriately render and act upon them.  Furthermore, if appropriate, web- and paper-based collection instruments can be automatically generated from this metadata.

Extend allowable answers with additional permitted values

In many cases, data elements can be reused but the allowable values need to be extended or restricted.  For example, one researcher may want to capture diseases of the nervous system while another may want to capture diseases of the cirulatory system.  These both can be captured in the same data element (disease) using the same controlled terminology (ICD-9).  However, the list of allowable values is quite different.  Furthermore, yet another researcher may want to focus only on certain circulatory diseases, such as those of the heart.  The metadata repository must allow for reuse of data elements while restricting or extending the permitted values.

C. Metadata Specialist Stories

Navigation and creation of metadata through modeling and web tools

The information, including names, semantic meaning, and linkages within and across information models provides a deluge of useful information for clinicians, informaticists, metadata specialists, and software engineers.  However, access to this deep and complex information in an intuitive manner can be challenging.  It is important that access be provided through modeling tools such that metadata can be discovered, reused, and created directly through the tooling that metadata specialists and software engineers are familiar with.  Furthermore, the information models themselves should be browsable through the web in a way that hides the complexity while revealing interesting relationships.

Managing semantic relationships in order to link and share data

In many cases, different systems call the same data element by different names even though they are semantically equivalent in a given context.  For example, a hospital system may have a Patient Last Name and a clinical trials system may have a Subject Surname.  Both of these data elements share a semantic equivalence, but it may be very difficult to combine them automatically.  The metadata registry should provide a way to describe semantic relationships such as this in order to enable the linking and sharing of data.

Supporting interoperability standards (e.g. Healthcare Datatypes)

Leveraging interoperability standards, such as standard data formats and datatypes are critical to data exchange within and across enterprises.  For example, ISO 21090, otherwise known as HL7 Healthcare Datatypes, provide a basic representation of common chunks of data exchanged in the healthcare community, such as Address, Document, and Coded List.  The metadata repository should be flexible enough to encode a variety of standards while restrictive enough to provide a common foundation for data exchange.  Furthermore, it is critical that organizations and individuals be able to restrict, or localize, these standards for custom use.

2nd paragraph: describe the fact that the KR can accommodate any UML-based model (including ISO 21090)

Capturing data in a standard way using data element reuse

Core to interoperability is capturing data in a standard way using the same or similar data elements.  Data elements individually can be reused, for example allowing for patient data to be joined across systems using the Patient Medical Record Number.  Forms in their entirety can be reused, such as eligibility forms for multi-site clinical trials.  Data formats for encoding biomedical data can be shared, such as MAGE-ML for gene expression data.  This allows for data to be captured in a standard way, shared across platforms and systems, for users to search based on the data that is encoded using type-ahead Google-like functionality, and for users to build new systems based on the standards that are already in use.

Support interoperable system design by finding touch points between information models

One of the key challenges in designing new systems to be interoperable with existing systems is to identify and integrate the touch points between the existing systems.  For example, in designing a system for capturing new biomedical data based on patient samples, it is important to know the key pieces of information used to link with other systems, such as biospecimen identifier, patient medical record number, bio-image identifier, etc.  The metadata repository should support the ability to discover the touch points amongst all the systems with registered metadata such that new systems can be designed in an interoperable fashion.

Support data transformations in order to allow different tools to work together

Even semantically harmonized tooling may utilize data of different formats.  For example, in flow cytrometry alone, there are a large number of standards for encoding data, such as MIFlowCyt, ACS, NetCDF, Gating-ML, FuGEFlow, and OBI.  When exchanging data between these systems, it is important to be able to describe the relationships between the standards, data elements, and value sets.  The structure of the data, the naming of the data elements, and the actual values used to encode the same data may need to be transformed in order to interoperate on the data.  On one hand, relationships between the standards can be manually described, and on the other hand, computable metadata enables automated transformation.

D. Developer Stories

Iterative development and management of information models

Iterative and Incremental development is a cyclic software development process developed in response to the weaknesses of the waterfall model. It starts with an initial planning and ends with deployment with the cyclic interaction in between.  The basic idea behind iterative enhancement is to develop a software system incrementally, allowing the developer to take advantage of what was being learned during the development of earlier, incremental, deliverable versions of the system. Learning comes from both the development and use of the system, where possible key steps in the process are to start with a simple implementation of a subset of the software requirements and iteratively enhance the evolving sequence of versions until the full system is implemented. At each iteration, design modifications are made and new functional capabilities are added.  In order to support an iterative development process, it is necessary that the metadata itself be iteratively developed.  The information model is enhanced, semantics added and removed, on a monthly basis.  The metadata repository itself must support the developer to create, modify, and remove metadata on an ongoing basis.

Support standardized processes for software development and conformance

caEHR is the flagship project that is applying the ECCF process - which, when applied effectively, should produce specifications that can be used to evaluate how and at what levels various information systems are interoperable. This is important to enabling coordination of IT resources across the community of NCI stakeholders. The caEHR project is currently creating and managing various artifacts (CFSS, PIM, PSM) manually. Significant challenges include: 1) managing traceability and change; 2) formulating conformance assertions so that they can be evaluated; 3) collaborating on model elements (i.e. distributed model authoring). 
The Knowledge Repository project should facilitate the application of the ECCF process by, providing a formal model of ECCF artifacts that can be queried to, for example, determine traceability among artifacts or to generate and synchronize other artifacts.

