Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Child pages
  • caBIO Data Sources
Skip to end of metadata
Go to start of metadata
Contents of this Page

caBIO not only provides a theoretical object model, but also provides a real-world instantiation of the objects based on a variety of genomic and clinical information data sources. NCI CBIIT aggregates these data into a database hosted at the NCI, and supports public access through the caBIO APIs. During the aggregation process, data relationships and links are captured and incorporated into the database. Thus caBIO provides a rich, integrated interface to many important biomedical information sources.

The caBIO database is refreshed with the latest versions of each data source on a semi-monthly basis. Information about each data refresh cycle is available in the caBIO Data Refresh Release notes.

Affymetrix

Data Source

Description

Affymetrix

Affymetrix provides the majority of Microarray data for caBIO. The data provides information on allele frequencies of the SNP in different populations, and is represented by the PopulationFrequency object.

The probeset information is available through ArrayReporter, ExpressionArrayReporter and SNPArrayReporter objects, among others.

Details regarding the arrays exposed in caBIO are available through Microarray

The GeneRelativeLocation object provides the location (intron, upstream, downstream etc.) of a SNP referenced in SNP Array datasets with respect to its associated genes. The validation status for a SNP comes from NCBI. The SNP Consortium (TSC) Ltd., a non-profit foundation, provides the TSC ids for SNPs in DatabaseCrossReference.

Agilent

Data Source

Description

Agilent

Data from Agilent's Whole Human Genome 44K Arrays and aCGH 244K Arrays are exposed through ArrayReporter, MicroArray and ExpressionArrayReporter. Associated Interpro Protein Domains and Unigene Genes are available through ProteinDomain and Gene objects.

BioCarta

Data Source

Description

Biocarta

BioCarta and its Proteomic Pathway Project (P3) provide detailed graphical renderings of pathway information concerning adhesion, apoptosis, cell activation, cell signaling, cell cycle regulation, cytokines/chemokines, developmental biology, hematopoeisis, immunology, metabolism, and neuroscience. NCI's CMAP web site captures pathway information from BioCarta, and transforms the downloaded image data into Scalable Vector Graphics (SVG) representations that support interactive manipulation of the online images. The CMAP web site displays BioCarta pathways selected by the user and provides options for highlighting anomalies, which include under- or over expressed genes as well as mutations.

The pathway information is available via the Pathway object in caBIO. caBIO provides a class for manipulating SVG diagrams.

Canada DrugBank

Data Source

Description

DrugBank

The Canada DrugBank is a repository containing detailed information on drugs (chemical, pharmaceutical, etc.) and drug targets (sequence, pathways, etc.). The repository contains over 4800 drug entries and 2500 drug targets and is available for download as open source.

caBIO provides access to additional drug information and drug targets through the caBIO API to augment existing information on drugs and drug targets obtained from the Cancer Gene Index project by cross referencing against PharmGKB drug database.

CGAP

Data Source

Description

CGAP

CGAP (Cancer Genome Anatomy Project) provides a collection of gene expression profiles of normal, pre-cancer, and cancer cells taken from various tissues. The CGAP interface allows users to browse these profiles by various search criteria, including histology type, tissue type, library protocol, and sample preparation methods. The goal at NCI is to exploit such expression profile information for the advancement of improved detection, diagnosis, and treatment for the cancer patient. Researchers have access to all CGAP data and biological resources for human and mouse, including ESTs, gene expression patterns, SNPs, cluster assemblies, and cytogenetic information.

The CGAP web site provides a powerful set of interactive data-mining tools to explore these data, and the caBIO project was initially conceived as a programmatic interface to these tools and data. Accordingly, most of the data that are available from CGAP can also be accessed through the caBIO objects. Exceptions are those data sets having proprietary restrictions, such as the Mitleman Chromosome Aberration database. CGAP also provides access to lists of sequence-verified human and mouse cDNA IMAGE clones supplied by Invitrogen.

CMAP

Data Source

Description

CMAP

The goal of CMAP (Cancer Molecular Analysis Project) is to enable researchers to identify and evaluate molecular targets in cancer.

The CMAP Profile Query tool finds genes with the highest or lowest expression levels (using SAGE and microarray data) for a given tissue and histology. Selecting a gene from the resulting table then leads to a Gene Info page. This page provides information about cytogenetic location, chromosome aberrations, protein similarities, curated and computed orthologs, and sequence-verified as well as full-length MGC clones, along with links to various other databases.

CTEP

Data Source

Description

CTEP

CTEP (Cancer Therapy Evaluation Program) funds an extensive national program of basic and clinical research to evaluate new anti-cancer agents, with a particular emphasis on translational research to elucidate molecular targets and drug mechanisms. In response to this emergent need for translational research, there has been a groundswell of translational support tools defining controlled vocabularies and registered terminologies to enhance electronic data exchange in areas that have heretofore been relatively non-computational. The caBIO trials data are updated with new CTEP data on a quarterly basis, and many of the objects are designed to support translational research.

For example, a caBIO Target object represents a molecule of special diagnostic or therapeutic interest for cancer research, and an Anomaly object is an observed deviation in the structure or expression of a Target. An Agent is a drug or other intervention that is effective in the presence of one or more specific Targets. The ClinicalTrialProtocol object organizes administrative information pertaining to that protocol. Data from CTEP are used to populate Protocols, ProtocolAgents and ProtocolDiseases objects.

Cancer Gene Index Project

Data Source

Description

Cancer Gene Index Project

The Cancer Gene Data Curation Pilot is an attempt to create a database of associations, derived from the biomedical literature, between genes and diseases and genes and drug compounds. The project involves a mixture of automatic text mining, semi-automatic verification, and manual validation/scoring of results. Data from this project is exposed through caBIO's Evidence, EvidenceCode, GeneDiseaseAssociation and GeneAgentAssociation objects.

Ensembl Compara

Data Source

Description

Ensembl Compara

DNA alignments across species can reveal the extent of evolutionary conservation in genetic material. The Ensembl Compara project runs many such alignment analyses and provides the results in various accessible formats. caBIO provides results from Compara's Pecan-based alignment analyses in particular, through the MultipleAlignment and related objects.

Entrez Gene

Data Source

Description

Entrez Gene

Entrez Gene contains curated sequence and descriptive information associated with a gene. Each entry includes information about the gene's nomenclature, aliases, sequence accession numbers, phenotypes, UniGene cluster IDs, OMIM IDs, gene homologies, associated diseases, map locations, and a list of related terms in the Gene Ontology Consortium's ontology. Sequence accessions include a subset of GenBank accessions for a gene, as well as the NCBI Reference Sequence. The LocusLink Identifier from Entrez Gene corresponding to a caBIO Gene is available in DatabaseCrossReference.

GAI

Data Source

Description

GAI

GAI (CGAP Genetic Annotation Initiative) is an NCI research program to explore and apply technology for identification and characterization of genetic variation in genes important in cancer. The GAI uses data-mining to identify "candidate" variation sites from publicly available DNA sequences, as well as laboratory methods to search for variations in cancer-related genes. All GAI candidate, validated, and confirmed genetic variants are available directly from the GAI web site, and all validated SNPs have been submitted to the NCBI dbSNP database as well, that are in turn available through the SNP.

Gene Ontology Consortium

Data Source

Description

Gene Ontology Consortium

The Gene Ontology Consortium provides a controlled vocabulary for the description of molecular functions, biological processes, and cellular components of gene products. The terms provided by the consortium define the recognized attributes of gene products and facilitate uniform queries across collaborating databases.

In general, each gene is associated with one or more biological processes, and each of these processes may in turn be associated with many genes. In addition, the GO ontologies define many parent/child relationships among terms. For example, a branch of the ontology tree under biological process contains the term "cell cycle control," which in turn bifurcates into the "child" terms cell cycle arrest, cell cycle checkpoint, control of mitosis, etc.

caBIO does not extract ontology terms directly from the Gene Ontology Consortium but rather extracts those terms stored with the LocusLink entry for that gene.

This information is available via GoClosure, GoGenes, GeneOntology and GeneOntologyRelationship objects.

HomoloGene

Data ource

Description

HomoloGene

HomoloGene is an NCBI resource for curated and calculated gene homologs. The caBIO data sources capture only the calculated homologs stored by HomoloGene. These calculated homologs are the result of nucleotide sequence comparisons performed between each pair of organisms represented in UniGene clusters. caBIO provides this information via HomologousAssociation and updates this data on a monthly basis.

HUGO Gene Nomenclature Committee

Data Source

Description

HUGO Gene Nomenclature Committee

HUGO (Human Genome Organization), an association of scientists involved in human genetics, approves for every gene, a gene-name, aliases and symbol. Each symbol is unique and is assigned to only one gene. This data is used to populate the GeneAlias and Gene objects (hugo symbol attribute) in caBIO. This in turn is used extensively in the ArrayAnnotationsAPI to search for caBIO Genes by their HUGO symbols or by their HUGO Aliases.

Illumina

Data Source

Description

Illumina

One of the SNP-Arrays in caBIO. Data from Illumina is used to populate MicroArray and ExpressionArrayReporter, among other objects.

LocusLink/Entrez Gene

Data Source

Description

LocusLink/Entrez Gene

LocusLink contains curated sequence and descriptive information associated with a gene. Each entry includes information about the gene's nomenclature, aliases, sequence accession numbers, phenotypes, UniGene cluster IDs, OMIM IDs, gene homologies, associated diseases, map locations, and a list of related terms in the Gene Ontology Consortium's ontology. Sequence accessions include a subset of GenBank accessions for a locus, as well as the NCBI Reference Sequence. This LocusLink Identifier corresponding to a caBIO Gene is available in DatabaseCrossReference.

MapView

Data Source

Description

MapView

Entrez Genomes presents a unified graphical view of maps (genetic and physical) and sequence data for a selected organism. The Entrez Map Viewer is a software component of Entrez Genomes which provides an organism's complete genome, integrated maps (when available) for each chromosome, and sequence data for a region of interest. Data from MapView is used to populate GenePhysicalLocation and MarkerPhysicalLocation objects.

Pathway Interaction Database (PID)

Data Source

Description

PID

The Pathway Interaction Database is a highly-structured, curated collection of information about known biomolecular interactions and key cellular processes assembled into signaling pathways. It is a collaborative project between the US National Cancer Institute (NCI) and Nature Publishing Group (NPG), and is an open access online resource.

dbSNP

Data Source

Description

dbSNP

In collaboration with the National Human Genome Research Institute, the NCBI has established the dbSNP database to serve as a central repository for both single base nucleotide substitutions and short deletion and insertion polymorphisms. Once discovered, these polymorphisms could be used by additional laboratories, using the sequence information around the polymorphism and the specific experimental conditions. (Note that dbSNP takes the looser 'variation' definition for SNPs, so there is no requirement or assumption about minimum allele frequency.) The data from dbSNP is updated approximately every 3-4 months. Relevant information is available through SNP, Provenance, Source, URLSourceReference, SourceReference, PhysicalLocation, Location, SNPPhysicalLocation, SNPCytogeneticLocation, GeneRelativeLocation, and MarkerRelativeLocation objects.

SNP Consortium

Data Source

Description

SNP Consortium

The SNP Consortium Ltd. is a non-profit foundation organized for providing public genomic data. Its mission is to develop up to 300,000 SNPs distributed evenly throughout the human genome and to make the information related to these SNPs available to the public without intellectual property restrictions.

The TSC Ids corresponding to a SNP from dbSNP are available through DatabaseCrossReference.

UCSC

Data Source

Description

USC

UCSC (University of California, Santa Cruz Distributed Annotation System) provides the data for the Chromosomal start and end positions of of mRNA, EST, and Cytoband sequences. The positions of cytogenetic bands within a chromosome, represented by the caBIO Cytoband object, are also obtained from the UCSC.

This data is used to populate the PhysicalLocation, Cytoband, CytogeneticLocation and Location objects with the locations of ESTs, MRNAs and Cytobands respectively.

The ESTs and MRNAs in PhysicalLocation and Location are linked with the corresponding sequence identifiers from NucleicAcidSequence.

UniGene

Data Source

Description

UniGene

Unigene provides a nonredundant partitioning of the genetic sequences contained in GenBank into gene clusters. Each cluster has a unique UniGene ID and a list of the mRNA and EST sequences that are included in that cluster. Related information stored with the cluster includes tissue types in which the gene has been expressed, mapping information, and the associated LocusLink, OMIM, and HomoloGene IDs, thus providing access to related information in those NCBI databases as well through DatabaseCrossReference.

Because the information in UniGene is centered around genes, access to Unigene is provided via the caBIO Gene objects. Specifically, the method getClusterId associated with a Gene object can be used to fetch the gene's UniGene ID. The corresponding IDs to cross-reference these genes into the NCBI OMIM and LocusLink databanks, Enzyme Commission, Ensembl and RefSeq databases can be obtained from DatabaseCrossReference using the getDatabaseCrossReferenceCollection method. While there is no explicit caBIO object corresponding to a Unigene cluster, all of the information associated with the cluster is available directly via the caBIO Gene object's methods.

The sequences are available via NucleicAcidSequence and the associated aliases are available in the GeneAlias. Corresponding clone library and its location information is exposed via Clone and CloneRelativeLocation.

Appropriate 'provenance' information is available in Provenance, Source, URLSourceReference and SourceReference objects. Markers associated with a Gene are available through the Marker object whereas its chromosomal and cytogenetic start-stops are available through GenePhysicalLocation and GeneCytogeneticLocation objects.

Associated histopathological and pathway information are available through Histopathology and Pathway respectively.

Unigene data is updated on a monthly basis.

UniProt PIR

Data Source

Description

UniProt PIR

Universal Protein Resource (UniProt) is a complete annotated protein sequence database and is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. The UniProt Knowledge base provides access to extensive curated protein information, including the amino acid sequence, protein name or description, taxonomic data and protein aliases.

caBIO exposes information from the Swissprot databanks through ProteinSequence and ProteinAlias objects.

Mappings to RefSeq Ids are available through DatabaseCrossReference.

Provenance-related information is available through Provenance, Source, SourceReference and URLSourceReference objects.

Protein Domain information from Interpro is exposed through the ProteinDomain object.

UniSTS

Data Source

Description

UniSTS

The UniSTS database at the National Center for Biotechnology Information provides a unified, non-redundant view of sequence tagged sites (STS's). UniSTS integrates marker and mapping data from a variety of public resources. Used in eGenome as a source of STS's, and to collect and manage element names data, UniSTS is used to populate Marker and MarkerAlias objects.

  • No labels