NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Scrollbar
iconsfalse

Page info
title
title

Include Page

...

Semantic Infrastructure 2.0 Roadmap Draft Status

...

Semantic Infrastructure 2.0 Roadmap Draft Status

Semantic Infrastructure 2.0 needs to address metadata and terminology related requirements from the life sciences domain. This will enable interoperability both between different sub-domains within life sciences, and between life sciences and other domains in caBIG® such as clinical trials and electronic health records.

While life sciences will leverage common semantic functionalities such as Enterprise Conformance and Compliance Framework (ECCF) registry, modeling, forms, behavioral semantics, terminology and value sets, there are aspects specific to the life sciences domain that need to be addressed, including but not limited to:

  • High change semantic environment where novel concepts which have not been previously characterized need to be described
  • Computational or analytical workflow compositions processing raw data to derive knowledge
  • Description of statistical processes related to computational processes to achieve the above
  • Working with the platform, enabling semantic description of raw data potentially of large volume (for example next-gen sequencing, imaging data)
  • Support of provenance to trace data acquisition and data ownership and also to achieve reproducibility of analytical results

As a starting point, the requirements gathering effort will be informed by past and current related efforts in caBIG®, such as:

  • The recent Semantic Infrastructure Requirements Elicitation effort
  • The caBIO ECCF service specification project on molecular and pathway annotation services from the Integrated Cancer Research (ICR) Workspace
  • Annotation and Image Markup (AIM) from the In-Vivo Imaging (IMAG) Workspace for both radiological and pathologic images
  • Work on "Dynamic Extensions" from the Tissue Banks and Pathology Tools (TBPT) Workspace

Nevertheless, further input from community feedback based on this roadmap document is anticipated, as well as further input from the life sciences workgroup within the Semantic Infrastructure 2.0 Inception effort. This effort will involve both requirements gathering and prototype tool building.

This section highlights some key use cases that depend on data semantics. These use cases provide a representative set to capture the requirements of the life sciences domain. A comprehensive set of all life sciences use-cases can be found at on the ICRi WG GForge wiki. This section includes the following:

Table of Contents
minLevel4

This section highlights some key use cases that depend on data semantics. These use cases provide a representative set to capture the requirements of the life sciences domain. A comprehensive set of all life sciences use-cases can be found at on the ICRi WG GForge wiki archive. This section includes the following:

Table of Contents
minLevel4

Note
titleNote

However, part of the Infrastructure Inception activities include Prototyping Orchestrations and/or Choreographies (including Life Science workflows) as well as outreach to communities to address other major use cases and requirements.The life sciences communities are engaged with the Roadmap inception efforts now on the following:

Refer to the pages listed for the use cases and requirements gathering activities. These will be moved to relevant sections in the Roadmaps when mature as use cases, requirements and resulting architecture design.

Discovering a biomarker

A scientist is trying to identify a new genetic biomarker for HER2/neu negative stage I breast cancer patients. Using a caGrid-aware client, the scientist queries for HER2/neu negative tissue specimens of Stage I breast cancer patients at LCCC (University of North Carolina Lineberger Comprehensive Cancer Center-NC Cancer Hospital) that also have corresponding microarray experiments. Analysis of the microarray experiments identify genes that are significantly over-expressed and under-expressed in a number of cases. The scientist decides that these results are significant, and related literature suggest a hypothesis that gene A may serve as a biomarker in HER2/neu negative Stage I breast cancer. To validate this hypothesis in a significant number of cases the scientist needs a larger data set, so the scientist queries for all the HER2/neu negative specimens of Stage I breast cancer patients with corresponding microarray data and also for appropriate control data from other cancer centers. After retrieving the microarray experiments the scientist analyzes the data for over-expression of genes A.

...

Version A is "Sequencing of selected genes via Maxim Gilbert Capillary (“First Generation”) sequencing." Nature. 2008 Sep 4 - Epub ahead of print (posted on GForge for the ICRi workgroup).

  1. Develop a list of 2000 to 3000 genes thought to be likely targets for cancer causing mutations.
  2. As a preliminary (lower cost) test, pick the most promising 600 genes from this list.
  3. Develop a gene model for each of these genes.
  4. Hand modify that gene model, for example, to merge small exons into a single amplicon.
  5. Design primers for PCR amplification for each of these genes.
  6. Order Primers for each exon of each of the genes.
  7. Test Primers.
  8. In parallel with steps 1-7, identify matched pairs of tumor samples and normal tissue from the same individual for the tumors of interest.
  9. Have pathologists confirm that the tumor samples are what they claim to be and that they consist of a high percentage of tumor tissue.
  10. Make DNA from the tumor samples, confirming for each tumor that quantity and quality of the DNA are adequate.
  11. PCR amplify each of the genes.
  12. Sequence each of the exons of each of the genes for each tumor and normal pair of DNA samples.
  13. Find all the differences between the tumor sequence and normal sequence.
  14. Confirm that these differences are real using custom arrays, the seqenome (Mass Spec) technology and biotage or both. (A biotage is pyrosequencing-based technology directed specifically at looking for SNP-like changes.)
  15. Identify changes that are seen at a higher frequency than what would occur by chance.
  16. Relate the genes in which these changes are seen to known signaling pathways.

...

Version B. As above, except globally sequence all genes. Science 321: 1807-1812 (2008) (posted on GForge for the ICRi workgroup). Delete steps 1 and 2 and replace step 3 with: 3) Develop a gene model for each of the genes in the Human genome.

...

Version C. Whole genome sequencing using second generation sequencers. Hypothetical (posted on GForge for the ICRi workgroup).

  1. Identify matched pairs of tumor samples and normal tissue from the same individual for the tumors of interest.
  2. Have pathologists confirm that the tumor samples are what they claim to be and that they consist of a high percentage of tumor tissue.
  3. Make DNA from the tumor samples, confirming for each tumor that the quantity and quality of the DNA are adequate.
  4. Sequence each of the sample pairs to the required fold coverage (7.5 to 35-fold, depending on the technology and read length).
  5. Map the individual reads to the canonical human genome sequence.
  6. Find all the differences between the tumor sequence and normal sequence.
  7. Confirm that these differences are real using custom arrays, the seqenome (Mass Spec) technology or biotage or both. (Biotage is a pyrosequencing-based technology directed specifically at looking for SNP-like changes).
  8. Identify changes that are seen at a higher frequency than what would occur by chance.
  9. Relate the genes in which these changes are seen to known signaling pathways.

...

The scientist submits a protocol to the institutional review board (IRB) and begins work upon approval. Libraries of surface-modified nanoparticles with appropriate pharmacokinetic and toxicity profiles are selected and screened for cell binding in vitro using cell cultures of “background” and “target” cell types or classes. The apparent concentration of binding or uptake of each nanoparticle to the different cell classes is measured. Metrics for differential binding to target versus background cells are calculated, and statistical significance is calculated by permutation. (These calculations employ analysis modules available through GenePattern (posted on GForge for the ICRi workgroup).

To validate the increased specificity for binding target cells, those that provide the best discrimination are further tested ex vivo. Under IRB approval, anatomically intact human tissue specimens containing target and background cells are collected. The tissues are incubated with nanoparticles and evaluated for nanoparticle localization using microscopy. Further validation is conducted in vivo using an animal model. Animals are injected with the nanoparticle and another tissue specific probe and intravital microscopy is used to determine the extent of co-localization. The scientist contacts the tech transfer office to pursue next steps.

...

This is a scenario based on evaluating and enriching the NanoParticle Ontology (NPO) (posted on GForge for the ICRi workgroup). The NanoParticle Ontology (posted on GForge for the ICRi workgroup) is an ontology which is being developed at Washington University in St. Louis to serve as a reference source of controlled vocabularies and terminologies in cancer nanotechnology research. Concepts in the NPO have their instances in the data represented in a database or in literature. In a database, these instances include field names, field entries, or both for the data model. The NPO represents the knowledge supporting unambiguous annotation and semantic interpretation of data in a database or in the literature. To expedite the development of the NPO, object models must be developed to capture the concepts and inter-concept relationships from the literature. Minimum information standards should provide guidelines for developing these object models, so the minimum information is also captured for representation in the NPO., an ontology which is being developed at Washington University in St. Louis to serve as a reference source of controlled vocabularies and terminologies in cancer nanotechnology research. Concepts in the NPO have their instances in the data represented in a database or in literature. In a database, these instances include field names, field entries, or both for the data model. The NPO represents the knowledge supporting unambiguous annotation and semantic interpretation of data in a database or in the literature. To expedite the development of the NPO, object models must be developed to capture the concepts and inter-concept relationships from the literature. Minimum information standards should provide guidelines for developing these object models, so the minimum information is also captured for representation in the NPO.

Nanotechnology is being applied to clinical therapeutics, but this use case could be extended to development of any specialized therapeutics. There are various pre-existing databases holding experimental data that need to be accessible across the entire community to facilitate rational nanomaterial design. Two strategies are being employed. The first is to establish semantic interoperability by finding areas of semantic overlap in the current database models based on controlled vocabularies (NCI Thesaurus, NCI Metathesaurus, Nanoparticle Ontology). The second is to develop a data submission standard based on the extension of standardized models (Biomedical Research Integrated Domain Group (BRIDG), Life Sciences Domain Analysis Model (LS-DAM)) where extensions are supported by controlled vocabularies. New vocabulary is needed to support both of these strategies. New concepts are curated in the controlled vocabularies as appropriate and term definitions are reviewed by the community.

scrollbar
Scrollbar
iconsfalse