Contents

THE CANCER GENE INDEX GENE-DISEASE AND GENE-COMPOUND XML DOCUMENTS

Introduction

The Cancer Gene Index is available as two ZIP files that contain the data from the Gene-Disease and Gene-Compound Databases. The Cancer Gene Index Gene-Disease and Gene-Compound "Databases" each include an XML document and an accompanying DTD, named CancerIndex_disease_XML.dtd and CancerIndex_compound_XML.dtd, respectively. Although they are large, the XML documents have an extremely simple structure.

XML System Requirements

The Cancer Gene Index Gene-Disease and Gene-Compound data sets require at least 720 MB of available hard drive space.

Warning!

Due to the size of the data resource, many applications may not be able to open the XML document.

Cancer Gene Index DTDs

In order to use the XML, you must first understand the DTD elements (which correspond to the XML elements) and how to interpret information within the elements (a brief #Introduction to XML and DTDs is available in the preceding section). The Gene-Disease and Gene-Compound DTD and XML documents each contain thirty elements. Twenty-seven of these elements and the sequence of these elements are identical for the Gene-Disease and Gene-Compound documents. Only the three elements specifically referring to or containing data on diseases or compounds differ between the documents.

Description of the Cancer Gene Index Gene-Disease DTD Elements

Gene-Disease DTD Element	Description
<!ELEMENT GeneEntryCollection (GeneEntry+)>	A collection of all gene, disease, evidence, and annotation information associated with a gene concept
<!ELEMENT GeneEntry (HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence*)>	All information associated with a particular gene concept
<!ELEMENT HUGOGeneSymbol (#PCDATA)>	HUGO Gene Symbol for the gene concept
<!ELEMENT GeneAliasCollection (GeneAlias+)>	A collection of acronyms, synonyms, alternate spellings, and other aliases for the gene concept
<!ELEMENT GeneAlias (#PCDATA)>	A specific synonym, alternate spelling, or other alias for the gene concept
<!ELEMENT SequenceIdentificationCollection (HgncID, LocusLinkID, GenbankAccession, RefSeqID, UniProtID)>	A collection of standard identifiers for the gene concept
<!ELEMENT HgncID (#PCDATA)>	HGNC Identifier for the gene concept ^A
<!ELEMENT LocusLinkID (#PCDATA)>	LocusLink Identifier for the gene concept ^A
<!ELEMENT GenbankAccession (#PCDATA)>	Genbank Accession Number for the gene concept ^B
<!ELEMENT RefSeqID (#PCDATA)>	RefSeq Identifier for the gene concept ^C
<!ELEMENT UniProtID (#PCDATA)>	UniProt Identifier corresponding to the gene concept ^C
<!ELEMENT GeneStatusFlag (#PCDATA)>	The status of the gene set by a human curator
<!ELEMENT Sentence (GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode, Roles, SentenceStatusFlag)>	Data and annotations for the extracted sentence for gene-disease concept pairs
<!ELEMENT GeneData (MatchedGeneTerm, NCIGeneConceptCode)>	The Gene Term and EVS Gene Concept Code Identifier for the gene concept of the gene-disease concept pair
<!ELEMENT MatchedGeneTerm (#PCDATA)>	Matched term of the gene concept
<!ELEMENT NCIGeneConceptCode (#PCDATA)>	Gene Concept Code corresponding to the Matched Gene Term ^B
<!ELEMENT DiseaseData (MatchedDiseaseTerm, NCIDiseaseConceptCode)>	Disease Term and EVS Concept Code Identifier for the disease concept of the gene-disease concept pair
<!ELEMENT MatchedDiseaseTerm (#PCDATA)>	NCI Thesaurus Matched Disease Term for the disease concept
<!ELEMENT NCIDiseaseConceptCode (#PCDATA)>	EVS Disease Concept Code corresponding to the Matched Disease Term
<!ELEMENT Statement (#PCDATA)>	Sentence statement containing the evidence of the gene-disease association
<!ELEMENT PubMedID (#PCDATA)>	PubMed Identifier for the abstract from which the evidence was extracted
<!ELEMENT Organism (#PCDATA)>	Organism from which the data were collected
<!ELEMENT NegationIndicator (#PCDATA)>	Whether the findings of a gene-disease association within a sentence were negative
<!ELEMENT CellineIndicator (#PCDATA)>	Whether the data were collected from a cell line
<!ELEMENT Comments (#PCDATA)>	Comments made by expert curators
<!ELEMENT EvidenceCode (#PCDATA)>	Evidence Code
<!ELEMENT Roles (PrimaryNCIRoleCode, OtherRole)>	Role Code and Role Detail for the gene-disease concept pair
<!ELEMENT PrimaryNCIRoleCode (#PCDATA)>	Role Code and Role Detail for the gene-disease concept pair
<!ELEMENT OtherRole (#PCDATA)>	Role Detail
<!ELEMENT SentenceStatusFlag (#PCDATA)>	Sentence Status Flag set by the expert curators

^A Some of the genes in the gene-disease concept pairs are not included in the HGNC or LocusLink and, thus, the text contents for these elements will be "0"
^B The text contents for this element are " "
^C RefSeq and UniProt Identifiers were taken from HGNC, which does not include all gene concepts included in the Cancer Gene Index. For the affected genes, the element's text contents will be "-"

Description of the Cancer Gene Index Gene-Compound DTD Elements

Gene-Disease DTD Element	Description
<!ELEMENT GeneEntryCollection (GeneEntry+)>	A collection of all gene, compound, evidence, and annotation information associated with a gene concept
<!ELEMENT GeneEntry (HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence*)>	All information associated with a particular gene concept
<!ELEMENT HUGOGeneSymbol (#PCDATA)>	HUGO Gene Symbol for the gene concept
<!ELEMENT GeneAliasCollection (GeneAlias+)>	A collection of acronyms, synonyms, alternate spellings, and other aliases for the gene concept
<!ELEMENT GeneAlias (#PCDATA)>	A specific synonym, alternate spelling, or other alias for the gene concept
<!ELEMENT SequenceIdentificationCollection (HgncID, LocusLinkID, GenbankAccession, RefSeqID, UniProtID)>	A collection of standard identifiers for the gene concept
<!ELEMENT HgncID (#PCDATA)>	HGNC Identifier for the gene concept ^A
<!ELEMENT LocusLinkID (#PCDATA)>	LocusLink Identifier for the gene concept ^A
<!ELEMENT GenbankAccession (#PCDATA)>	Genbank Accession Number for the gene concept ^B
<!ELEMENT RefSeqID (#PCDATA)>	RefSeq Identifier for the gene concept ^C
<!ELEMENT UniProtID (#PCDATA)>	UniProt Identifier corresponding to the gene concept ^C
<!ELEMENT GeneStatusFlag (#PCDATA)>	The status of the gene set by a human curator
<!ELEMENT Sentence (GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode, Roles, SentenceStatusFlag)>	Data and annotations for the extracted sentence for gene-compound concept pairs
<!ELEMENT GeneData (MatchedGeneTerm, NCIGeneConceptCode)>	The Gene Term and EVS Gene Concept Code Identifier for the gene concept of the gene-compound concept pair
<!ELEMENT MatchedGeneTerm (#PCDATA)>	Matched term of the gene concept
<!ELEMENT NCIGeneConceptCode (#PCDATA)>	Gene Concept Code corresponding to the Matched Gene Term ^B
<!ELEMENT DrugData (MatchedDrugTerm, NCIDrugConceptCode)>	Compound Term and EVS Concept Code Identifier for the compound concept of the gene-compound concept pair
<!ELEMENT MatchedDrugTerm (#PCDATA)>	NCI Thesaurus Matched Compound Term for the compound concept
<!ELEMENT NCIDrugConceptCode (#PCDATA)>	EVS Compound Concept Code corresponding to the Matched Compound Term
<!ELEMENT Statement (#PCDATA)>	Sentence statement containing the evidence of the gene-compound association
<!ELEMENT PubMedID (#PCDATA)>	PubMed Identifier for the abstract from which the evidence was extracted
<!ELEMENT Organism (#PCDATA)>	Organism from which the data were collected
<!ELEMENT NegationIndicator (#PCDATA)>	Whether the findings of a gene-compound association within a sentence were negative
<!ELEMENT CellineIndicator (#PCDATA)>	Whether the data were collected from a cell line
<!ELEMENT Comments (#PCDATA)>	Comments made by expert curators
<!ELEMENT EvidenceCode (#PCDATA)>	Evidence Code
<!ELEMENT Roles (PrimaryNCIRoleCode, OtherRole)>	Role Code and Role Detail for the gene-compound concept pair
<!ELEMENT PrimaryNCIRoleCode (#PCDATA)>	Role Code and Role Detail for the gene-compound concept pair
<!ELEMENT OtherRole (#PCDATA)>	Role Detail
<!ELEMENT SentenceStatusFlag (#PCDATA)>	Sentence Status Flag set by the expert curators

^A Some of the genes in the gene-compound concept pairs are not included in the HGNC or LocusLink and, thus, the text contents for these elements will be "0"
^B The text contents for this element are " "
^C RefSeq and UniProt Identifiers were taken from HGNC, which does not include all gene concepts included in the Cancer Gene Index. For the affected genes, the element's text contents will be "-"

While reviewing the Gene-Disease and Gene-Compound DTD elements, you likely noticed the inclusion of parenthetical information and special characters. Consider the following example Cancer Gene Index DTD elements:

<!ELEMENT GeneEntry (HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence
<!ELEMENT HUGOGeneSymbol (#PCDATA)>
<!ELEMENT GeneAliasCollection (GeneAlias+)>
<!ELEMENT GeneAlias (#PCDATA)>
<!ELEMENT Sentence (GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode*, Roles*, SentenceStatusFlag)>

!ELEMENT GeneEntry defines that the GeneEntry element contains five child elements: HUGOGeneSymbol, GeneAliasCollection, SequenceIdentificationCollection, GeneStatusFlag, Sentence*
!ELEMENT HUGOGeneSymbol defines the HUGOGeneSymbol element to be of type #PCDATA
!ELEMENT GeneAliasCollection defines the GeneAliasCollection element to be of type GeneAlias+
!ELEMENT GeneAlias defines the GeneAlias element to be of type #PCDATA
!ELEMENT Sentence defines the Sentence element contains eleven elements: GeneData, DiseaseData, Statement, PubMedID, Organism, NegationIndicator, CellineIndicator, Comments?, EvidenceCode*, Roles*, SentenceStatusFlag

#PCDATA stands for Parsed Character data. Declarations of element type #PCDATA mean that XML Parsers will parse the text contents found between the start and end tags of an XML element that correspond to this DTD element.

Cancer Gene Index elements not only contain child elements and text elements, but also information about the presence of child elements and the number of times a particular element can recur. Elements with one or more child elements declare the name(s) of the child elements as comma-separated lists inside parentheses. Examples of Cancer Gene Index elements with multiple child elements are given above in 1 and 5.

Note

Child elements must appear in the same order in the XML document and the DTD, and they can themselves have one or more children. Sentence, a child element of GeneEntry in example 1 above, has eleven children in example 5.

Special characters (e.g., +, *, ?) appended to the name of a child element describe the expected number of occurrences of element. The + character in example 3 above declares that the child element GeneAlias must occur one or more times inside the GeneAliasCollection element. The * character in examples 1 and 5 above declares that the child element Sentence, EvidenceCode, and Roles can occur zero or more times inside the GeneEntry and Sentence elements. The ? character in example 5 above declares that the child element Comments can be absent or occur one time inside the Sentence element.

Parsing the Cancer Gene Index XML

Many free XML parsers exist, as do parsing modules or libraries for a variety of common programming languages, that will quickly divide the Gene-Disease and Gene-Compound XML documents into their component data. Parsed data can be stored in a database, spreadsheet, text editors, or other data management application and be computed against. Note that, given the size of the resource, all but the first of these options will likely need to be broken up into multiple files. Alternatively, you may prefer to write code that recursively loops through the XML and extracts the information that you desire. As End Users parse the Cancer Gene Index data into various formats (e.g., database dumps or tab-delimited text files), they are strongly encouraged to make these versions and the code used to create them available by posting them to the Cancer Gene Index User Community web page.

Note

Depending on how you plan to use the Cancer Gene Index XML data, you may wish to leave the text element entities for special characters (e.g., <, >, &, ", and ') in place, and replace them only when generating a final data sub set in order to avoid issues with escaping.

The NCI Thesaurus provides parent/child hierarchies for its concepts. Although these gene, disease, and compound (or, in NCI Thesaurus, an "agent") concept ontologies were used to construct the Cancer Gene Index #Lexical Dictionaries, they are not easily deduced from information within the Cancer Gene Index, itself. Because a unique EVS Identifier is given for each gene, disease, and compound/agent, it is possible to trace back to the hierarchical data structures with the NCI Thesaurus graphical user interface by clicking on the "Relationships" tab on or the View in red Hierarchy button any term's page, the caBIO interfaces, or the Enterprise Vocabulary Services (EVS) API.

Content

Space Tools

The Cancer Gene Index Gene-Disease and Gene-Compound XML Documents

THE CANCER GENE INDEX GENE-DISEASE AND GENE-COMPOUND XML DOCUMENTS

Introduction

Cancer Gene Index DTDs

Parsing the Cancer Gene Index XML