ABOUT THE CANCER GENE INDEX

Overview of the Cancer Gene Index

There are nearly 2.5 million cancer-related publications in MEDLINE as of December 2009, and this number is rapidly increasing. Scientists cannot manually identify all known cancer genes, and it is even more difficult to uncover the relationships between these genes and various human cancers or pharmacological compounds. In theory, one could exhaustively search PubMed and compile, for example, a list of the genes related to a given disease or compound, but this would take many weeks, and it is highly likely that such a manual search would still miss some genes. The National Cancer Institute (NCI) recognized that a publicly-available resource that compiled these gene-disease and gene-compound data with relevant annotations would greatly facilitate research, and as part of its caBIG® initiative, it created the Cancer Gene Index Project.

The goal of the Cancer Gene Index is to further translational cancer research by providing a high quality data resource consisting of genes that have been experimentally associated with human cancers and/or pharmacological substances, the evidence of these associations, and relevant annotations on the data. Thus, scientists can use the data resource to quickly discover fact-based associations between genes and diseases or genes and compounds (i.e., all of the genes associated with a disease, all of the genes associated with a compound, or all of the diseases and compounds associated with a gene) and to evaluate the evidence from which these associations were determined. This extremely valuable resource was created through a unique process that coupled automated linguistic text analysis of millions of MEDLINE abstracts with manual validation and annotation of the extracted data. Details on this process are found in the section #Creation of the Cancer Gene Index.

The Cancer Gene Index includes data on 6,955 unique human genes, nearly 12,000 cancer disease terms from a variety of public sources, and 2,180 unique pharmacologic compounds from the NCI Thesaurus. Associations between genes and diseases or genes and compounds were extracted from over 92 million analyzed sentences of nearly 20 million abstracts. The resource was last updated in June, 2009.

Means of Accessing Cancer Gene Index Data

The Cancer Gene Index is available as computer-readable Gene-Disease and Gene-Compound data files. To effectively use these files, you must be a bioinformaticist or computer programmer-scientist or collaborate with someone who has this expertise. Ideally, intuitive graphical user interfaces (GUIs) would allow all scientists to quickly and easily access these data and exploit the full power of the Cancer Gene Index data resource. Several preliminary caBIO interfaces already exist, and these can begin to give you an appreciation for the full potential of the data resource. In addition, geWorkbench pulls some Cancer Gene Index data from caBIO as annotations on genomic data or the Cancer Molecular Analysis Portal provides limited views of the Cancer Gene Index data.

At this time the caBIO interfaces are not yet fully-featured, as is the case with the caBIO Portlet Templated Search, and in many cases are difficult for the average scientist to utilize, as with the caBIO Portal and the Simple Search of the caBIO Portlet on the caGrid Portal. An effort is currently underway to improve the caBIO GUIs for scientist end users. In the future, it is expected that these and other GUIs will be fully functional for all scientists.

With this step-by-step guide, a persistent and cautious scientist can use the Templated Search in conjunction with the caBIO Portal and NCI Thesaurus to find lists of genes associated with diseases or compounds or of diseases and compounds associated with a gene.

Selecting the Best Way for You to Access Cancer Gene Index Data

The following section will help you select the best means to access Cancer Gene Index data based on your experience with bioinformatics and computer programming.

If you have limited knowledge of the caBIO object model and caBIG®, you should use the Cancer Gene Index Gene-Disease and Gene-Compound XML documents and the XML documents guide. The format of these documents is extremely simple, making them very easy with which to work. To download the XML documents, you must have a computer with at least 720 MB of free disk space, an internet connection, and a web browser; other system requirements depend upon the way in which you intend to use the data resource.
If you are familiar with the caBIO object model and caBIG®, you may wish to use one of the caBIO APIs. The caBIO APIs allow you to uncover associations within the Cancer Gene Index data set and to find additional information linking these data with associated pathways, protein annotations, clinical protocols, and other biomedical entities. Even with knowledge of the caBIO object model, however, it can be difficult to construct complex queries of caBIO. For information on system requirements, please refer to the links for each API on the caBIO wiki page.

You should use the step-by-step guide for the caBIO Portlet #Templated Search tool. All that is required to access this web-based GUI is a computer with an internet connection and a web browser. Although it is easy to uncover gene-disease and gene-compound associations with this tool, it does not allow you to limit your search results and thus can return genes, diseases, or compounds that do not have validated associations. Also, it does not necessarily return all of the data you would like. Thus, you must use this tool in conjunction with

- The caBIO Object Graph Browser and potentially
- The NCI Thesaurus

You can use the step-by-step guide to the #caBIO Portal, which has the #Freestyle Lexical Mine and #Search for Biological Entities tools. In contrast to the caBIO Portlet Templated Search, these interfaces expose the entirety of the Cancer Gene Index. The caBIO Portal is similar to PubMed in that queries will retrieve many results that you must sift through, examining each to determine whether or not it is useful. Unlike PubMed, caBIO is much more likely to return the information that you want.
All that is required to access these web-based caBIO search tools is a computer with an internet connection and a web browser.

If you would like to view Cancer Gene Index data on the go, you can use the caBIO #iPhone Application.

The caBIO Portlet also has a #Simple Search tool. This tool is currently of limited utility, and instead you should use the XML, caBIO Portlet Templated Search, or even the caBIO Portal. In the event that, you still wish to learn more about the Simple Search, a step-by-step guide is provided.

Examples of How the Cancer Gene Index Facilitates Translational Research

The Cancer Gene Index is a resource that can facilitate many different types of cancer research. In this first example from the Cancer Gene Index Project poster, the data resource is used to validate colon cancer translational medicine research data. Here, scientists have obtained access to deidentified demographic data, histopathology data (lymph node pN, tumor size pT, and degree of metastasis G), and tumor tissue biospecimens from patients, which are represented by gray figures. The scientists perform gene expression microarrays on each colon cancer biospecimen (pink and red colon tissue cells). The genes (purple DNA fragments) with significantly altered expression are validated by cross-referencing the Cancer Gene Index.

FIGURE 1.2 HERE!!!!

The Cancer Gene Index also may be used for lymphoma biomarker discovery. This example from the Cancer Gene Index Project poster illustrates that researchers can use the data resource to quickly identify the genes (purple DNA fragments) that are associated with and may be biomarkers for Lymphoma. Here, gene-disease concept pair associations are shown as blue "to diseases" arrows. By searching the Cancer Gene Index for therapeutic compounds that are associated with these genes, scientists easily uncover which of these candidate disease biomarkers are also associated with lymphoma-related compounds. An association between the gene encoding SPN, also known as sialophorin or CD43, and the compound leflunomide is represented by a black "has validated association with" arrow. Cancer Gene Index data can be cross-referenced to other resources, such as the clinical trial protocol database Physician Data Query® (PDQ) to obtain information about trials that link these data.

FIGURE 1.3!!!!!!!

Documentation Sections

The Cancer Gene Index End User Documentation is organized into sections, each of which is a different page in the Cancer Gene Index wiki. Each wiki page you will provides you with the necessary links to related content. Should you wish to access additional sections of this wiki, you may use the following links:

About the Cancer Gene Index
Creation of the Cancer Gene Index
Data, Metadata, and Annotations
An Introduction to XML and DTDs
Cancer Gene Index Gene-Disease and Gene-Compound XML Documents
caBIO APIs
caBIO Portlet Templated Searches
An Introduction to Classes, Objects, and Object Models
caBIO Portal
caBIO iPhone Application
caBIO Simple Searches
Glossary
Credits and Resources