NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

A. Domain User Stories

Enabling concept-driven, cross-domain searching

Ad hoc searching and reporting tools are critically important to help enable new discovery.  A clinician sits down to a biomedical informatics tool to make the following searches (taken from the requirements elicitation):

  • Search for all pre-cancerous biospecimens from caTissue instances like those at Washington University, Thomas Jefferson University, and Holden Comprehensive Cancer Center. 
  • Identify samples obtained for Glioblastoma multiforme (GBM) and the corresponding CT image information. This query can be performed by querying across caTissue and NBIA using caB2B. 
  • Determine if each sample used in an expression profiling experiment is available for a SNP analysis experiment. This query can be performed by querying across caTissue and caArray using caB2B. 
  • Search for a particular gene based on the Entrez Gene ID and its related information  e.g. messenger RNA and protein information from GeneConnect.
  • An important aspect of these searches is to be able to construct the search using well-defined data elements and semantics, apply the search to federated repositories, and aggregate the results in a way that does not bog the user down in the technical details of the information space and technologies.

Using inference to identify analytical steps in a pipeline

A common practice in bioinformatics research is to build pipelines of data transformation an analytical steps.  For example, Illumina bead array analysis may involved data input, quality control, BeadArray-specific variance stabilization, normalization, and gene annotation at the probe level.  The different steps may involve different data inputs.  It is possible to automatically discover these analytical steps using inference based on the semantic metadata of the parameters.

2nd paragraph: how KR uses inference and service specification models

Supporting trial matching through the use of computable eligibility criteria

Matching of patients to trials is crucial to NCI's mission to reduce suffering due to cancer. Patient-oriented applications (e.g. BreastCancerTrials.org) enable patients to discover trials that appropriate for them. Protocol-oriented applications enable clinical researchers and Pharma to identify and recruit appropriate patient cohorts. Inadequate study matching support prevents patients from receiving promising therapies and prevents advances in treatment. Challenges to study matching include both evaluation of eligibility criteria over heterogeneous patient information models (e.g. EMRs found in various hospitals) as well as coordination of the authoring of eligibility criteria across studies (e.g. to improve study design so as to maximize accrual).

The Knowledge Repository project should provide capabilities to describe semantic relationships among informations models (e.g. between a hospital's EMR and the BRIDG model), so as to allow a matching application to re-write eligibility criteria that are specified in terms of one information model (e.g. BRIDG) into another form that can be evaluated over another information model (e.g. a hospital's EMR). It should also enable modeling and sharing of eligibility criteria so that they can be discovered and reused. The Clinical Trials Reporting Program (CTRP) (and several other projects) are developing tooling to support authoring of structured eligibility criteria. Those applications should be able to discover and register these criteria to the Knowledge Repository. The caEHR project will eventually need to enable BRIDG-based queries of patient information (e.g. History & Physical) over the information models of existing EMRs. Those components should be able to leverage semantic model relationships that are described in a Knowledge Repository instance.

Simple extension of systems and their information models

caTissue is application for managing a biospecimen repository. It supports searching for biospecimens across multiple, distributed repositories. The caTissue information model provides a common view of the data that is contained in these repositories. However, this information model must be considered "underspecified" since details about biospecimens and patient encounters depends on the particular disease(s) that is being studied. Therefore, caTissue provides a way to extend the basic information model by defining new kinds of information that can be associated with patients, specimens, and "collection groups." Furthermore, to support searching for biospecimens across repositories, these new kinds of information must be 1) published, 2) defined in a way that others can understand it, and 3) allow for executing searches that use this information. caTissue supports items 2 and 3, but item 1 is hampered by the current centralized nature of caDSR. Information model extensions cannot be easily shared.

caIntegrator provides a data-warehouse for biomedical data collection and analysis.  This data is often characterized by high dimensionality and/or large size.  New datasets can easily be added by defining the variables being collected for a new study.  It is important that these new datatypes be described in a well-defined and federated manner so that data can be shared.

The Knowledge Repository will support this type functionality by proving distributed, federated model registry services. For example, a consortium of biospecimen repositories will be able to publish caTissue information model extensions to a shared model registry. The extensions in that registry will refer to the common caTissue information model that is published to the NCI model registry.

B. Forms Stories

Support data collection form creation and reuse

Forms provide a convenient paper-like electronic mechanism to capture data in a structured way.  For example, when a patient is placed on a clinical trial, data about the patient's demographics and eligibility for the trial need to be captured.  Forms are often reused in their entirety or in part across organizations, such as in multi-site clinical trials.  The metadata repository must be able to support the creation and sharing of data collection forms.

Allowing form annotations to enable form behavior

Forms provide a convenient paper-like electronic mechanism to capture data in a structured way.  For example, when a patient is placed on a clinical trial, data about the patient's demographics and eligibility for the trial need to be captured.  However, forms can also exhibit specific behavior that may or may not be reusable.  These include skip patterns (if the answer to question 10 is "Yes" then skip to question 15), derived values ("what is your age" and "is your age less than, greater than, or equal to 65), and composite answers ("check all" or "more than one of the above").  Furthermore, specific requirements about how a form is rendered can exist.  For example, the question description, help text, valid values, maximum and minimum answer length, the format of a data mask (such as SSN), etc. It is important to be able to allow for forms to be annotated with this behavior such that tools can appropriately render and act upon them.  Furthermore, if appropriate, web- and paper-based collection instruments can be automatically generated from this metadata.

Extend allowable answers with additional permitted values

In many cases, data elements can be reused but the allowable values need to be extended or restricted.  For example, one researcher may want to capture diseases of the nervous system while another may want to capture diseases of the cirulatory system.  These both can be captured in the same data element (disease) using the same controlled terminology (ICD-9).  However, the list of allowable values is quite different.  Furthermore, yet another researcher may want to focus only on certain circulatory diseases, such as those of the heart.  The metadata repository must allow for reuse of data elements while restricting or extending the permitted values.

C. Metadata Specialist Stories

Navigation and creation of metadata through modeling and web tools

The information, including names, semantic meaning, and linkages within and across information models provides a deluge of useful information for clinicians, informaticists, metadata specialists, and software engineers.  However, access to this deep and complex information in an intuitive manner can be challenging.  It is important that access be provided through modeling tools such that metadata can be discovered, reused, and created directly through the tooling that metadata specialists and software engineers are familiar with.  Furthermore, the information models themselves should be browsable through the web in a way that hides the complexity while revealing interesting relationships.

Managing semantic relationships in order to link and share data

In many cases, different systems call the same data element by different names even though they are semantically equivalent in a given context.  For example, a hospital system may have a Patient Last Name and a clinical trials system may have a Subject Surname.  Both of these data elements share a semantic equivalence, but it may be very difficult to combine them automatically.  The metadata registry should provide a way to describe semantic relationships such as this in order to enable the linking and sharing of data.

Supporting interoperability standards (e.g. Healthcare Datatypes)

Leveraging interoperability standards, such as standard data formats and datatypes are critical to data exchange within and across enterprises.  For example, ISO 21090, otherwise known as HL7 Healthcare Datatypes, provide a basic representation of common chunks of data exchanged in the healthcare community, such as Address, Document, and Coded List.  The metadata repository should be flexible enough to encode a variety of standards while restrictive enough to provide a common foundation for data exchange.  Furthermore, it is critical that organizations and individuals be able to restrict, or localize, these standards for custom use.

2nd paragraph: describe the fact that the KR can accommodate any UML-based model (including ISO 21090)

Capturing data in a standard way using data element reuse

Core to interoperability is capturing data in a standard way using the same or similar data elements.  Data elements individually can be reused, for example allowing for patient data to be joined across systems using the Patient Medical Record Number.  Forms in their entirety can be reused, such as eligibility forms for multi-site clinical trials.  Data formats for encoding biomedical data can be shared, such as MAGE-ML for gene expression data.  This allows for data to be captured in a standard way, shared across platforms and systems, for users to search based on the data that is encoded using type-ahead Google-like functionality, and for users to build new systems based on the standards that are already in use.

Support interoperable system design by finding touch points between information models

One of the key challenges in designing new systems to be interoperable with existing systems is to identify and integrate the touch points between the existing systems.  For example, in designing a system for capturing new biomedical data based on patient samples, it is important to know the key pieces of information used to link with other systems, such as biospecimen identifier, patient medical record number, bio-image identifier, etc.  The metadata repository should support the ability to discover the touch points amongst all the systems with registered metadata such that new systems can be designed in an interoperable fashion.

Support data transformations in order to allow different tools to work together

Even semantically harmonized tooling may utilize data of different formats.  For example, in flow cytrometry alone, there are a large number of standards for encoding data, such as MIFlowCyt, ACS, NetCDF, Gating-ML, FuGEFlow, and OBI.  When exchanging data between these systems, it is important to be able to describe the relationships between the standards, data elements, and value sets.  The structure of the data, the naming of the data elements, and the actual values used to encode the same data may need to be transformed in order to interoperate on the data.  On one hand, relationships between the standards can be manually described, and on the other hand, computable metadata enables automated transformation.

D. Developer Stories

Iterative development and management of information models

Iterative and Incremental development is a cyclic software development process developed in response to the weaknesses of the waterfall model. It starts with an initial planning and ends with deployment with the cyclic interaction in between.  The basic idea behind iterative enhancement is to develop a software system incrementally, allowing the developer to take advantage of what was being learned during the development of earlier, incremental, deliverable versions of the system. Learning comes from both the development and use of the system, where possible key steps in the process are to start with a simple implementation of a subset of the software requirements and iteratively enhance the evolving sequence of versions until the full system is implemented. At each iteration, design modifications are made and new functional capabilities are added.  In order to support an iterative development process, it is necessary that the metadata itself be iteratively developed.  The information model is enhanced, semantics added and removed, on a monthly basis.  The metadata repository itself must support the developer to create, modify, and remove metadata on an ongoing basis.

Support standardized processes for software development and conformance

caEHR is the flagship project that is applying the ECCF process - which, when applied effectively, should produce specifications that can be used to evaluate how and at what levels various information systems are interoperable. This is important to enabling coordination of IT resources across the community of NCI stakeholders. The caEHR project is currently creating and managing various artifacts (CFSS, PIM, PSM) manually. Significant challenges include: 1) managing traceability and change; 2) formulating conformance assertions so that they can be evaluated; 3) collaborating on model elements (i.e. distributed model authoring). 
The Knowledge Repository project should facilitate the application of the ECCF process by, providing a formal model of ECCF artifacts that can be queried to, for example, determine traceability among artifacts or to generate and synchronize other artifacts.