This chapter describes how to use caIntegrator tools to analyze data in subject annotation or genomic studies that have been deployed in caIntegrator.

Topics in this chaptersection include the following:

Data Analysis Overview

Once a study has been deployed, you can analyze the data using caIntegrator analysis tools.

You can verify that the study has "Deployed" status by selecting the study name in the My Studies dropdown selector. After selecting the study name, click Home in the left sidebar of the caIntegrator menu. A study summary should appear, including a status field. If the status is not deployed, or if the study summary does not appear, then the study is not deployed nor available for analysis.

If the study is ready for analysis, you will see an Analysis Tools menu in the left sidebar with the following options:

After defining or running the analysis on selected data sets, analysis results display on the same page, allowing you to review the analysis method parameters you defined.

Creating Kaplan-Meier Plots

This topic opens from any of the three K-M plot tabs. For specific details about working with these tabs, see the following topics:

The Kaplan_Meier method analyzes comparative groups of patients or samples. In caIntegrator, the K-M method can compare survival statistics among comparative groups. You can configure the survival data in the application. For example, you might identify a group of patients with smoking history and compare survival rates with a group of non-smoking patients, or compare the survival data for two groups of patients with a specific disease type, based on Karnofsky scores. You could compare groups of patients with varying gene expression levels. You can also identify data sets using the query feature in the application, saving the queries, then configuring the K-M to compare groups identified by the queries.

The key is to first identify subsets of patients or samples that meet criteria you want to establish, thus filtering the data you want to compare. Next, generate a K-M plot based on their survival probability as a function of time. Survival differences are analyzed by the log-rank test.

caIntegrator calculates the log-rank p-Valuelog-rank p-value for the data, indicating the significance of the difference in survival between any two groups of samples. The log rank p-value is calculated using the Mantel-Haenszel method. The p-values are recalculated every time a new plot is generated.

To perform a K-M plot analysis, survival data must have been identified for the study you want to analyze or an Annotation Field Descriptor such as DAYSTODEATH has been set to Data Type 'numeric'. For more information, see on page 29.

K-M Plot for Annotations

The groups identified for this K-M plot generation are based on annotations.

caIntegrator generates the plot which then displays below the plot criteria ().
”A K-M plot generated for groups based on annotations”

K-M Plot for Gene Expression

caIntegrator allows you to compare expression levels for one given gene in different representative groups. The relative expression level is referred to as "fold change". Fold change is the ratio of the measured gene expression value in an experimental sample as determined by a reporter to a reference value calculated for that reporter against all control samples. The reference value is calculated by taking the mean of the log2 of the expression values for all control samples for the reporter in question. The log2 mean value (thumbs down) is then converted back to a comparable expression signal by returning 2 to the exponent n.

To create a K-M plot illustrating gene expression values, follow these steps:

caIntegrator provides three methods whereby you can obtain gene symbols for calculating a KM plot for gene expression. For more information, see #Choosing Genes.

If the study has more than one platform associated with it, the platform is inherently selected when you select the control set. Control sets are comprised of samples from only one platform.

Click the Create Plot button. caIntegrator generates the plot which then displays below the plot criteria ().See .

K-M Plot for Queries and Saved Lists

You can identify data sets using the query feature in the application. You can manipulate the queries to find the groups you want to compare, save the queries, then configure the K-M to compare the query groups. This is one method of limiting the data considered in the K-M plot calculation.

See .
K-M Plot for Queries Display
After you have defined the criteria as described in , caIntegrator generates the plot which then displays below the plot criteria.
K-M Plot comparing statistics between subjects in two queries

Creating Gene Expression Plots

Gene expression plots compare signal values from reporters or genes. This statistical tool allows you to compare values for multiple genes at a time, but it does not require only two sets of data to be compared. It also allows you to compare expression levels for selected genes against expression levels for a set of control samples designated at the time of study definition.

caIntegrator provides three ways to generate meaningful gene expression plots, indicated by tabs on the page. The tabs are independent of each other and allow you to select the genes, reporters and sample groups to be analyzed on the plot.

See also .

Gene Expression Value Plot for Annotation

To generate a gene expression plot, follow these steps:

caIntegrator provides three methods whereby you can obtain gene symbols for calculating a gene expression plot. For more information, see .

See .
Gene Expression Plot for Annotation Display*
After you have defined the criteria as described in , caIntegrator generates the plot which then displays below the plot criteria.
Legends below the plot indicate the plot input. By default, the plot shows the mean of the data. displays a plot with gene expression median calculation summaries.
Gene expression plot based on selected annotations

Gene Expression Value Plot for Genomic Queries

Data to be analyzed on this tab must have been saved as a genomic query. For more information, see on page 62.

To generate a gene expression plot using a genomic query, follow these steps:

Gene Expression Value Plot for Annotation and Saved List Queries

Data to be analyzed on this tab must have been saved as a subject annotation query, but it must have genomic data identified in the query. For more information, see on page 31. For the genomic data, you must identify genes whose expression values are used to calculate the plot.

To generate a gene expression plot using an annotation query, follow these steps:

caIntegrator provides three methods whereby you can obtain gene symbols for calculating a gene expression plot. For more information, see #Choosing Genes.

See .

Gene Expression Plot for Saved Queries Display*

After you have defined the criteria as described in , caIntegrator generates the plot which displays in bar graph format below the plot criteria.
By default, caIntegrator displays the mean of the data below the plot criteria. Legends below the plot indicate the plot input.
”Gene expression plot based on annotation queries gene expression values”

Understanding a Gene Expression Plot

Above the plot, you can select various plot types. When you do so, the plot is recalculated. Although all of the plots in this section appear similar, note the differences in calculation results and legends between the Y axis on each of the plots.
When you perform a Gene Expression simple search, by default the Gene Expression Plot () appears.
”Gene expression plot calculating the mean”

The Gene Expression Plot () displays mean expression intensity (Geometric mean) versus Groups.
”Gene expression plot calculating the median”

The log2 intensity Gene Expression Plot, shown in the following figure, displays average expression intensities for the gene of interest based on Affymetrix GeneChip arrays (U133 Plus 2.0 arrays).
”Gene expression plot displaying log2 intensity values”

The box and whisker log2 expression intensity plot displays a box plot (, ). Example box and whisker plot:uses foruses of box and whisker plots include the following:

In descriptive statistics, a box plot or boxplot, also known as a box-and-whisker diagram or plot, is a convenient way of graphically depicting groups of numerical data through their five-number summaries (the smallest observation excluding outliers, lower quartile \[Q1\], median \[Q2\], upper quartile \[Q3\], and largest observation excluding outliers).

The box is defined by Q1 and Q3 with a line in the middle for Q2. The interquartile range, or IQR, is defined as Q3-Q1. The lines above and below the box, or 'whiskers', are at the largest and smallest non-outliers. Outliers are defined as values that are more than 1.5 * IQR greater than Q3 and less than 1.5 * IQR than Q1. Outliers, if present, are shown as open circles ().
”Box and whisker plot showing outliers”

Boxplots can be useful to display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data.

Analyzing Data with GenePattern

GenePattern is an application developed at the Broad Institute that enables researchers to access various methods to analyze genomic data. caIntegrator provides an express link to GenePattern where you can analyze data in any caIntegrator study.

Information is included in this section for connecting to GenePattern from caIntegrator. Specifics for launching GenePattern tools from caIntegrator are included as well, but you may want to refer to additional GenePattern documentation, available at this website: .

You have two options for using GenePattern from caIntegrator:

The GenePattern feature in caIntegrator currently supports three analyses on the grid: Comparative Marker Selection (CMS), Principal Component Analysis (PCA) and GISTIC-supported analysis.

If you are using the web interface to access GenePattern (option #1 listed above), then you can run other GenePattern tools in addition to CMS, PCA and GISTIC.

GenePattern Modules

To launch the analyses described in this section, you must have a registered GenePattern account. For more information, see

To configure the link for accessing GenePattern from caIntegrator, open the appropriate page as described in .

The CLS file format defines phenotype (class or template) labels and associates each sample in the expression data with a label. It uses spaces or tabs to separate the fields. The CLS file format differs somewhat depending on whether you are defining categorical or continuous phenotypes:

Once the analysis is launched, caIntegrator returns to the GenePattern Analysis Status page where you can monitor the status of your current study which is listed in the Analysis Method column as well as view information about other GP analyses that have been run on this study.
”GenePattern Analysis Status page displays a list of GenePattern analysis performed on the current study”

If you choose to access GenePattern in this way, you can continue to use GenePattern tools from within that application. See GenePattern user documentation for more information.

If you run these analyses within GenePattern itself, you may be able to view results in the GenePattern visualization module. Click View Results on the row where the results are listed. If you run them on the grid from caIntegrator, your results will be available only in spreadsheet and XML format.

You can run GenePattern analyses for Comparative Marker Selection, Principal Component Analysis and GISTIC-based analysis on the grid if you choose.

Comparative Marker Selection (CMS) Analysis

The Comparative Marker Selection (CMS) module implements several methods to look for expression values that correlate with the differences between classes of samples. Given two classes of samples, CMS finds expression values that correlate with the difference between those two classes. If there are more than two classes, CMS can perform one-vs-all or all-pairs comparisons, depending on which option is chosen.

For more information, see the GenePattern website: .

To perform a CMS analysis, follow these steps:

caIntegrator takes you to the JobStatus/Launch page where you will see the job and its status in the Status column of the list (). ”The progress of a GenePattern analysis that has been launched displays in the status column of page”

Principal Component Analysis (PCA)

Principal Component Analysis is typically used to transform a collection of correlated variables into a smaller number of uncorrelated variables, or components. Those components are typically sorted so that the first one captures most of the underlying variability and each succeeding component captures as much of the remaining variability as possible.

You can configure GenePattern grid parameters for preprocessing the dataset in addition to PCA module parameters. For more information, see the GenePattern website: .
To perform a PCA analysis, follow these steps:

GISTIC-Supported Analysis

The GISTIC test option displays only if the study contains copy number or SNP data. For more information, see on page 38.

The GISTIC Module is a GenePattern tool that identifies regions of the genome that are significantly amplified or deleted across a set of samples. For more information, see .

To perform a GISTIC-supported analysis, follow these steps:

Viewing Data with the Integrative Genomics Viewer

Once you have run a query for gene expression, on page 54, or copy number data, on page 55, you can view results in the Integrative Genomics Viewer:viewing data inIntegrative Genomics Viewer (IGV).

The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated datasets. It supports a wide variety of data types including sequence alignments, microarrays, and genomic annotations.

For more information about the Integrative Genomics Viewer or to connect independently to the IGV home page, click this link: .The IGV viewer and the NCI Heat Map viewer both require you to install a version of Java containing Java Web Start. For more information, see #Java for IGV and Heat Map Viewewr.

There are two ways to integrate caIntegrator with the IGV. To configure the connection to IGV, follow one of these methods.
Method 1

This opens the genome site at UCSC , where you can learn more about the gene (). ”Example of the kind of metadata you can learn about a gene at the UCSC genome website”

Go to the following website for a user guide for IGV: Method 2

Viewing Data with Heat Map Viewer

Heat Map Viewer:viewing data inOnce you have run a query for gene expression, on page 54, or copy number data, on page 55, you can view results in the Heat Map Viewer (HMV).

For more information about the Heat Map Viewer or to connect independently to the HMV home page, click this link: .The IGV viewer and the NCI Heat Map viewer both require you to install a version of Java containing Java Web Start. For more information, see #Java for IGV and Heat Map Viewer.

There are two ways to integrate caIntegrator with the Heat Map Viewer. To configure the connection, follow one of these methods.
Method 1

Go to the following website for Heat Map Viewer documentation: Method 2

Java for IGV and Heat Map Viewer

To use the IGV and the NCI Heat Map viewer, described in and , you must install a version of Java containing Java Web Start. You must install recent versions of the Java Development Kit (JDK 1.5.0 aka JDK 5.0 or newer) or Java Runtime Environment (JRE 1.5.0 aka JRE 5.0 or newer). The easiest option is to install JRE 5.0. For more information, see: .

Without Java Web Start, when you click Launch Integrative Genomics Viewer or Launch Heat Map Viewer, a dialog box displays in your browser giving you the option to save or open with (IGV) or (HMV). Clicking the Open option starts the Java Web Start Launcher (default) installing the Java app so that you can view the files.

The first time you launch the IGV or HMV with Java properly installed, regardless of browser type, a warning may appear :the "the digital signature cannot be verified". Click Run to proceed with opening the viewer.