NIH | National Cancer Institute | NCI Wiki  

Error rendering macro 'rw-search'

null

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 17 Next »

Contents

To Print the Guide

We recommend you print one wiki page of the guide at a time. To do this, click the printer icon at the top right of the page; then from the browser File menu, choose Print. Printing multiple pages at one time is more complex. For instructions, refer to How do I print multiple pages?.

Introduction

The Cancer Gene Index is available as two ZIP files that contain the data from the Gene-Disease and Gene-Compound Databases. The Cancer Gene Index Gene-Disease and Gene-Compound "Databases" each include an XML document and an accompanying DTD, CancerIndex_disease_XML.dtd and CancerIndex_compound_XML.dtd, respectively. You may freely download the Gene-Disease file and Gene-Compound file from the Cancer Gene Index website.

XML System Requirements

The Cancer Gene Index Gene-Disease and Gene-Compound data sets require at least 720 MB of available hard drive space.

Cancer Gene Index DTDs

In order to use the XML, you must first understand the DTD elements (which correspond to the XML elements) and how to interpret information within the elements. The Gene-Disease and Gene-Compound DTD and XML documents each contain thirty elements. Twenty-seven of these elements and the sequence of these elements are identical for the Gene-Disease and Gene-Compound documents. Only the three elements specifically referring to or containing data on diseases or compounds differ between the documents.


Description of the Cancer Gene Index Gene-Disease DTD Elements

Gene-Disease DTD Element

Description

<!ELEMENT GeneEntryCollection (GeneEntry+)>

A collection of all gene, disease, evidence, and annotation information associated with a gene concept

<!ELEMENT GeneEntry (HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence*)>

All information associated with a particular gene concept

<!ELEMENT HUGOGeneSymbol (#PCDATA)>

HUGO Gene Symbol for the gene concept

<!ELEMENT GeneAliasCollection (GeneAlias+)>

A collection of acronyms, synonyms, alternate spellings, and other aliases for the gene concept

<!ELEMENT GeneAlias (#PCDATA)>

A specific synonym, alternate spelling, or other alias for the gene concept

<!ELEMENT SequenceIdentificationCollection (HgncID, LocusLinkID, GenbankAccession, RefSeqID, UniProtID)>

A collection of standard identifiers for the gene concept

<!ELEMENT HgncID (#PCDATA)>

HGNC Identifier for the gene concept A

<!ELEMENT LocusLinkID (#PCDATA)>

LocusLink Identifier for the gene concept A

<!ELEMENT GenbankAccession (#PCDATA)>

Genbank Accession Number for the gene concept B

<!ELEMENT RefSeqID (#PCDATA)>

RefSeq Identifier for the gene concept C

<!ELEMENT UniProtID (#PCDATA)>

UniProt Identifier corresponding to the gene concept C

<!ELEMENT GeneStatusFlag (#PCDATA)>

The status of the gene set by a human curator

<!ELEMENT Sentence (GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode*, Roles*, SentenceStatusFlag)>

Data and annotations for the extracted sentence for gene-disease concept pairs

<!ELEMENT GeneData (MatchedGeneTerm, NCIGeneConceptCode)>

The Gene Term and EVS Gene Concept Code Identifier for the gene concept of the gene-disease concept pair

<!ELEMENT MatchedGeneTerm (#PCDATA)>

Matched term of the gene concept

<!ELEMENT NCIGeneConceptCode (#PCDATA)>

Gene Concept Code corresponding to the Matched Gene Term B

<!ELEMENT DiseaseData (MatchedDiseaseTerm, NCIDiseaseConceptCode)>

Disease Term and EVS Concept Code Identifier for the disease concept of the gene-disease concept pair

<!ELEMENT MatchedDiseaseTerm (#PCDATA)>

NCI Thesaurus Matched Disease Term for the disease concept

<!ELEMENT NCIDiseaseConceptCode (#PCDATA)>

EVS Disease Concept Code corresponding to the Matched Disease Term

<!ELEMENT Statement (#PCDATA)>

Sentence statement containing the evidence of the gene-disease association

<!ELEMENT PubMedID (#PCDATA)>

PubMed Identifier for the abstract from which the evidence was extracted

<!ELEMENT Organism (#PCDATA)>

Organism from which the data were collected

<!ELEMENT NegationIndicator (#PCDATA)>

Whether the findings of a gene-disease association within a sentence were negative

<!ELEMENT CellineIndicator (#PCDATA)>

Whether the data were collected from a cell line

<!ELEMENT Comments (#PCDATA)>

Comments made by expert curators

<!ELEMENT EvidenceCode (#PCDATA)>

Evidence Code

<!ELEMENT Roles (PrimaryNCIRoleCode*, OtherRole*)>

Role Code and Role Detail for the gene-disease concept pair

<!ELEMENT PrimaryNCIRoleCode (#PCDATA)>

Role Code and Role Detail for the gene-disease concept pair

<!ELEMENT OtherRole (#PCDATA)>

Role Detail

<!ELEMENT SentenceStatusFlag (#PCDATA)>

Sentence Status Flag set by the expert curators

A Some of the genes in the gene-disease concept pairs are not included in the HGNC or LocusLink and, thus, the text contents for these elements will be "0"
B The text contents for this element are " "
C RefSeq and UniProt Identifiers were taken from HGNC, which does not include all gene concepts included in the Cancer Gene Index. For the affected genes, the element's text contents will be "-"

Description of the Cancer Gene Index Gene-Compound DTD Elements

Gene-Disease DTD Element

Description

<!ELEMENT GeneEntryCollection (GeneEntry+)>

A collection of all gene, compound, evidence, and annotation information associated with a gene concept

<!ELEMENT GeneEntry (HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence*)>

All information associated with a particular gene concept

<!ELEMENT HUGOGeneSymbol (#PCDATA)>

HUGO Gene Symbol for the gene concept

<!ELEMENT GeneAliasCollection (GeneAlias+)>

A collection of acronyms, synonyms, alternate spellings, and other aliases for the gene concept

<!ELEMENT GeneAlias (#PCDATA)>

A specific synonym, alternate spelling, or other alias for the gene concept

<!ELEMENT SequenceIdentificationCollection (HgncID, LocusLinkID, GenbankAccession, RefSeqID, UniProtID)>

A collection of standard identifiers for the gene concept

<!ELEMENT HgncID (#PCDATA)>

HGNC Identifier for the gene concept A

<!ELEMENT LocusLinkID (#PCDATA)>

LocusLink Identifier for the gene concept A

<!ELEMENT GenbankAccession (#PCDATA)>

Genbank Accession Number for the gene concept B

<!ELEMENT RefSeqID (#PCDATA)>

RefSeq Identifier for the gene concept C

<!ELEMENT UniProtID (#PCDATA)>

UniProt Identifier corresponding to the gene concept C

<!ELEMENT GeneStatusFlag (#PCDATA)>

The status of the gene set by a human curator

<!ELEMENT Sentence (GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode*, Roles*, SentenceStatusFlag)>

Data and annotations for the extracted sentence for gene-compound concept pairs

<!ELEMENT GeneData (MatchedGeneTerm, NCIGeneConceptCode)>

The Gene Term and EVS Gene Concept Code Identifier for the gene concept of the gene-compound concept pair

<!ELEMENT MatchedGeneTerm (#PCDATA)>

Matched term of the gene concept

<!ELEMENT NCIGeneConceptCode (#PCDATA)>

Gene Concept Code corresponding to the Matched Gene Term B

<!ELEMENT DrugData (MatchedDrugTerm, NCIDrugConceptCode)>

Compound Term and EVS Concept Code Identifier for the compound concept of the gene-compound concept pair

<!ELEMENT MatchedDrugTerm (#PCDATA)>

NCI Thesaurus Matched Compound Term for the compound concept

<!ELEMENT NCIDrugConceptCode (#PCDATA)>

EVS Compound Concept Code corresponding to the Matched Compound Term

<!ELEMENT Statement (#PCDATA)>

Sentence statement containing the evidence of the gene-compound association

<!ELEMENT PubMedID (#PCDATA)>

PubMed Identifier for the abstract from which the evidence was extracted

<!ELEMENT Organism (#PCDATA)>

Organism from which the data were collected

<!ELEMENT NegationIndicator (#PCDATA)>

Whether the findings of a gene-compound association within a sentence were negative

<!ELEMENT CellineIndicator (#PCDATA)>

Whether the data were collected from a cell line

<!ELEMENT Comments (#PCDATA)>

Comments made by expert curators

<!ELEMENT EvidenceCode (#PCDATA)>

Evidence Code

<!ELEMENT Roles (PrimaryNCIRoleCode*, OtherRole*)>

Role Code and Role Detail for the gene-compound concept pair

<!ELEMENT PrimaryNCIRoleCode (#PCDATA)>

Role Code and Role Detail for the gene-compound concept pair

<!ELEMENT OtherRole (#PCDATA)>

Role Detail

<!ELEMENT SentenceStatusFlag (#PCDATA)>

Sentence Status Flag set by the expert curators

A Some of the genes in the gene-compound concept pairs are not included in the HGNC or LocusLink and, thus, the text contents for these elements will be "0"
B The text contents for this element are " "
C RefSeq and UniProt Identifiers were taken from HGNC, which does not include all gene concepts included in the Cancer Gene Index. For the affected genes, the element's text contents will be "-"

Additional DTD Information

The Gene-Disease and Gene-Compound DTD elements include meaningful parenthetical information and special characters. Consider the following example Cancer Gene Index DTD elements:

  1. <!ELEMENT GeneEntry (HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence
  2. <!ELEMENT HUGOGeneSymbol (#PCDATA)>
  3. <!ELEMENT GeneAliasCollection (GeneAlias+)>
  4. <!ELEMENT GeneAlias (#PCDATA)>
  5. <!ELEMENT Sentence (GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode*, Roles*, SentenceStatusFlag)>

!ELEMENT GeneEntry defines that the GeneEntry element contains five child elements: HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence*
!ELEMENT HUGOGeneSymbol defines the HUGOGeneSymbol element to be of type #PCDATA
!ELEMENT GeneAliasCollection defines the GeneAliasCollection element to be of type GeneAlias+
!ELEMENT GeneAlias defines the GeneAlias element to be of type #PCDATA
!ELEMENT Sentence defines the Sentence element contains eleven elements: GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode*, Roles*, SentenceStatusFlag

#PCDATA stands for Parsed Character data. Declarations of element type #PCDATA mean that XML Parsers will parse the text contents found between the start and end tags of an XML element that correspond to this DTD element.

Cancer Gene Index elements not only contain child elements and text elements, but also information about the presence of child elements and the number of times a particular element can recur. Elements with one or more child elements declare the name(s) of the child elements as comma-separated lists inside parentheses. Examples of Cancer Gene Index elements with multiple child elements are given above in 1 and 5.

Note

Child elements appear in the same order in the XML documents as the DTDs, and they can themselves have one or more children, as described below.

Special characters (e.g., +, *, ?) appended to the name of a child element describe the expected number of occurrences of element. The + character in example 3 above declares that the child element GeneAlias must occur one or more times inside the GeneAliasCollection element. The * character in examples 1 and 5 above declares that the child element Sentence, EvidenceCode, and Roles can occur zero or more times inside the GeneEntry and Sentence elements. The ? character in example 5 above declares that the child element Comments can be absent or occur one time inside the Sentence element.

Parsing the Cancer Gene Index XML

Many free XML parsers exist, as do parsing modules or libraries for a variety of common programming languages, that will quickly divide the Gene-Disease and Gene-Compound XML documents into their component data. Parsed data can be stored in a database or other data management application and be computed against. Alternatively, you may prefer to write code that recursively loops through the XML and extracts the information that you desire. As end users parse the Cancer Gene Index data into various formats (e.g., database dumps or tab-delimited text files) or create code to walk through the XML, they are strongly encouraged to make these versions and the code available by posting them to the Cancer Gene Index User Community Parsed Data and Code web page.

Using the Cancer Gene Index Data

Refining Your Searches with Flags and Indicators

You can use the Cancer Gene Index to discover associations between genes and diseases or genes and compounds. These associations were derived from the literature using a sophisticated automated process, and thus not all of the extracted gene-disease or gene-compound concept pair associations were found to be factual during validation by expert human curators.

Tip

If you would like to restrict your queries of the Cancer Gene Index data sets to only those concept pairs that were validated as being truly associated, filter out information where the SentenceStatusFlag is "no_fact" or "unclear" and where the NegationIndicator is "yes."

You can also take advantage of other annotations. You could also filter by CelllineIndicator or Organism, should you wish to exclude data derived from cell lines or non-human species, for example. For information about the status flags, indicators, and other annotations within the XML documents, refer to the Cancer Gene Index Data, Metadata, and Annotations wiki page.

Codes and Details

The expert curators also set Evidence Codes, Role Codes, and Role Details. Evidence codes (EvidenceCodes) qualify the assertions of a gene-disease or gene-compound association made in the sentence and provide information on how the these assertions were made. Role Codes (PrimaryNCIRoleCode) and Role Details (OtherRole) describe the semantic associations between gene and either a disease or compound term. Whereas the Evidence Codes describe how the association was inferred or the type of experiment upon which the inference was made, Role Codes and Role Details give information about the actual gene-disease or gene-compound association.

For information about the meaning of the codes, details, and other data and annotations within the XML documents, refer to the Cancer Gene Index Data, Metadata, and Annotations wiki page.

Gene, Disease, and Compound Ontologies

The NCI Thesaurus provides ontological information for its concepts. Although these gene, disease, and compound (or, in NCI Thesaurus, an "agent") concept ontologies were used to construct the Cancer Gene Index Lexical Dictionaries, they are not easily deduced from information within the Cancer Gene Index, itself.

Note

Because the Cancer Gene Index does not include information on disease, compound, and gene ontologies, searches for a particular disease term, for example, will only return genes that match that exact term.

Using the NCI Thesaurus concept terms (e.g., MatchedGeneTerm or MatchedDrugTerm) or NCI Thesaurus Concept Code (e.g. NCIDiseaseConceptCode or NCIDrugConceptCode), it is possible, however, to trace back to the hierarchical disease, compound, and gene data with the NCI Thesaurus graphical user interface or the Enterprise Vocabulary Services (EVS) API.

Tip

The NCI Thesaurus Concept Code for a gene, disease, or compound term is also its EVS Identifier.

Using the NCI Thesaurus to Find Parent/Child Concepts

To view disease, compound, and gene ontologies, open a new browser tab or window and navigate to the NCI Thesaurus web page, enter in your gene symbol, matched term, or NCI Thesaurus concept code (2, "ovarian serous adenocarcinoma"), and click the Search button (3). If you need help finding your gene, disease, or compound term, click the Contact Us link at the bottom of the page (4).

You may view parent and child terms for any disease term by clicking on the Relationships tab (blue box). For example, "ovarian serous adenocarcinoma" has the children "ovarian serous cystadenocarcinoma" and "ovarian serous papillary adenocarcinoma" and the parent terms "malignant ovarian serous tumor," "ovarian adenocarcinoma," and "serous adenocarcinoma." Alternatively, if you would like to view where your term fits in the entire disease hierarchy, click the red View in Hierarchy button (green box).

  • No labels