NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar:icons=false}

Problem: How To Add Data to an Existing Study

Topic: caIntegrator Usage

...

  1. Log into caIntegrator via the application's main Web page.

illustration of stepImage Modified
First log into caIntegrator via the application's main Web page. In this example, I've logged in with my username (hd2266).

  1. Each installation of caIntegrator can host several studies. When you first log in, you will be taken to the home page of the default study, which in this case is entitled 'jagla-00034'. Since this is not the study we want to add data to, you will want to bring up a list of available studies by clicking on the 'Manage Studies' link under the 'STUDY MANAGEMENT' menu in the navigation panel to the left.

illustration of stepImage Modified
Once you log in, you are taken to the home page of the default study, which in this case is 'jagla-00034'. The study we want to add data to is 'Demo Study for ICR Folks', which you can access by clicking on the 'Manage Studies' link (highlighted in red).

On the Manage Studies page, find the study entitled 'Demo Study for ICR Folks' in the table of studies, then click on the 'Edit' link under the Action column at the far right of the table.
illustration of stepImage Modified
You can edit the study entitled 'Demo Study for ICR Folks', which is at the top of the study list, by clicking on the 'Edit' link (highlighted in red).

...

Note that this study already has some subject annotation and genomic data loaded. The annotation data is in the form of the CSV file 'subject_annotation_DC_Lung_Study_111210.CSV', while the genomic data is in the form of a link to the address of the caArray server which hosts the data (array.nci.nih.gov), as well as an experiment identifier (jacob-00182) which references the particular experiment containing the data of interest. Later in this tutorial, we will examine in depth how to load more of this data into the study.

illustration of stepImage Modified
This study already has subject annotation and genomic data loaded; they are listed beneath their respective headings, which are highlighted in red. Later in this tutorial, we'll learn how to load more data into this study.

...

  1. Now we're ready to load additional subject annotation data into the 'Demo Study for ICR Folks'. As mentioned before, you'll need the data in the form of a CSV file containing at least one field with a unique ID for each subject in the study. The CSV file we'll use in this tutorial is called 'subject_annotations_tutorial.CSV'. A partial screenshot of the file appears below as viewed in a Microsoft Excel 2007 window.

illustration of stepImage Modified

This data came from a fictional multi-site study that compared gene expression between lung adenocarcinoma patients and healthy controls. The nature of the data itself is irrelevant to our purpose here. The relevant aspect is that the data is categorized into five fields, which are represented by columns in the spreadsheet.

...

  1. When you click on the 'Choose File' button, you'll be prompted for the location of the data file with an Open dialog. Locate the CSV file containing your subject data, click on it, then click on the 'Open' button.

illustration of stepImage Modified
In this example, we click on the 'subject_annotations_tutorial.CSV' file (highlighted in red), then click on the Open button. Your own annotations file will be named differently.

  1. Once you open your annotations file, you'll be taken back to the Edit Study page, where you can click on the 'Upload Now' button at the bottom of the area to load the file into the study. illustration of stepImage Modified

Click on the 'Upload Now' button (highlighted in red) to load the subject data into the study.

  1. Once you've uploaded the data, you'll encounter another page prompting you to define the various fields for your subject data. Since these fields were already defined when the study was created, we don't need to modify them. Just click on the 'Save' button at the bottom of the page to continue.

illustration of stepImage Modified
Click on the Save button (highlighted in red) to confirm your annotation field definitions.

  1. Back on the 'Edit Study' page, the newly uploaded source will now appear in the table beneath the 'Subject Annotation Data Sources' heading. Notice that the status of this source appears as 'Not Loaded' under the Status column. To change this, click on the 'Load Subject Annotation Source' button under the Action column.

illustration of stepImage Modified
The newly uploaded source now appears in the second row (highlighted in red) of the Data Sources table. Click on the 'Load Subject Annotation Source' button under the Action column to load the source.

...

To understand why this error is occurring, let's examine the contents of the new annotation file we just tried to load. A partial screenshot of the file appears below as viewed in a Microsoft Excel 2007 window.
illustration of stepImage Modified

Notice that this file contains not only new subjects (IDs 6000 to 6002), but also some of the same subjects (i.e., IDs 3, 5, and 10) from the previously loaded file "subject_annotation_DC_Lung_Study_111210.csv". In addition, the values in the 'Stratagene' field for these subjects are different in the new file than they were in the original file. This explains the 'Value Already Loaded' error message which occurs when we attempt to load the file – this message is another way of saying that the file we're trying to load contains duplicates of subjects from previously loaded files.

...

  1. We can't query the study unless it's already been deployed. To check whether this is the case, scroll all the way down to the bottom of the 'Edit Study' page, where you'll see a row of three buttons. If the study has been deployed, as is the case in our example, the left button labeled 'Deploy Study' will be grayed out and you will not be able to click on it. If, however, the study hasn't been deployed, the button will appear normally, and you can click on it to deploy the study.

illustration of stepImage Modified
The bottom of the 'Edit Study' page shows the 'Deploy Study' button (highlighted in red). In this example, the study has already been deployed so this button is grayed out. If your study hasn't yet been deployed, the button will appear normally, and you can click on it to deploy the study.

  1. Now that we've loaded our clinical data into the study, let's query it. To get started, click on the link 'Search Demo Study for ICR Folks' under the menu 'DEMO STUDY FOR ICR FOLKS' in the navigation panel to the left.

illustration of stepImage Modified
Click on the link 'Search Demo Study for ICR Folks' (highlighted in red) to perform a query on the annotation data you just uploaded.

...

As an example, let's say we want to query the data for all male subjects located at the 'MI' study site. In this case, our two query criteria are 'Site' and 'Gender', and their respective query values are 'MI' and 'Male'. We can formulate the query by first clicking on the 'Add' button to the right of the drop-down list under the 'Define Query Criteria' heading.

illustration of stepImage Modified
To begin formulating your query, click on the 'Add' button (highlighted in red).

  1. Next, click on the drop-down list that appears below the 'Add' button. The list contains three items: 'Site', 'Stratagene', and 'Survival in Months'. Click on 'Site'.

illustration of stepImage Modified
Click on 'Site' (highlighted in red) from the Annotations drop-down list to select it as a query criterion.

  1. Once you click on 'Site', another two drop-down lists will appear to the right of the original one. Click on the third (rightmost) list to bring up the different values for Site and click on 'MI' from this list.

illustration of stepImage Modified
Click on 'MI' (highlighted in red) in the drop-down list of values for the Site field.

...

To add Gender as a field, go back to the original drop-down list (the one at the top), click on it again, click on 'Demographic' in the list, and then click on the 'Add' button to the right of the list.

illustration of stepImage Modified
Select Demographic (highlighted in red) from the drop-down list, then click on the Add button (also highlighted in red).

  1. Next, a new drop-down list labeled 'Demographic' will appear below the one labeled 'Annotations – Default'. Click on this new list, then click on 'GENDER'.

illustration of stepImage Modified
Click on 'GENDER' in the 'Demographic' drop-down list.

  1. Once you click on 'Gender', another two drop-down lists will appear to the right of the original one. Click on the third (rightmost) list to bring up the different values for Gender and click on 'Male' from this list.

illustration of stepImage Modified
Click on 'Male' (highlighted in red) in the third (rightmost) drop-down list labeled 'Demographic'.

  1. Now that we've fully defined our query, we're ready to run it. Click on the 'Run Query' button at the bottom of the page to see the results.

illustration of stepImage Modified
Click on the 'Run Query' button (highlighted in red) to see results.

...

You can sort these results in numerical order of subject ID by clicking on the 'Subject ID' heading above the right table column.

illustration of stepImage Modified
You can sort query results by clicking on the Subject ID heading (highlighted in red) above the right column.

  1. You can customize the display of query results by clicking on the 'Results Type' tab at the top of the page and selecting additional fields to be displayed via the checklists for each annotation set. In this example, we checked off 'Stratagene' and 'Survival in Months' in the default annotation checklist.

illustration of stepImage Modified
You can select additional fields (highlighted in red) to be displayed in the query results by selecting them from the checklists in the 'Results Type' tab, then clicking on the 'Run Query' button (also highlighted in red).

If you now click on the 'Run Query' button at the bottom right of the page, the results will be displayed again under the 'Query Results' tab, but this time with the additional columns Stratagene and Survival in Months, which correspond to the new fields we selected.

illustration of stepImage Modified
The updated query results include two additional columns (highlighted in red) which correspond to the two additional fields we selected under the 'Results Type' tab.

  1. To save this query in caIntegrator for future reference, click on the 'Save query as..' tab at the top of the page, enter a name and description for the query in the respective fields, and click on the 'Save Query' button at the bottom.

illustration of stepImage Modified
You can save the query by clicking on the 'Save query as..' tab, entering a query name and description, and clicking on the 'Save Query' button (highlighted in red).

  1. Once the query is saved, the Search page will reload and the Study Data menu in the left navigation panel will expand to show the newly saved query 'Tutorial' under the 'My Queries' heading. You can click on the magnifying glass icon to the left of the Tutorial link to bring up the query results again, or on the pencil icon to edit the query criteria.

illustration of stepImage Modified
The 'Tutorial' query (highlighted in red) is now saved under the 'STUDY DATA' menu in the left navigation panel and can be accessed at any time.

...

  1. To begin, navigate back to the 'Edit study' page for the 'Demo Study for ICR Folks'. If you forgot how to do this, you can refer to step 2 in this tutorial.
  2. On the 'Edit study' page, scroll down to the 'Genomic Data Sources' heading. The table below it shows that one source has already been loaded and mapped. To add another, start by clicking the 'Add New' button to the right of the heading.

illustration of stepImage Modified
Click on the 'Add New' button (highlighted in red) to begin adding a new genomic data source.

...

If your server hostname or any of the other values for your data source differ from the default values, then enter them into their respective fields, then click on the 'Save' button at the bottom of the page. (Remember that, if your study is private, you must enter the login credentials into the 'Username' and 'Password' fields.)

illustration of stepImage Modified
Enter the values for your data source if they differ from the default values, then click on the 'Save' button (highlighted in red). Don't forget to enter your caArray experiment ID – the ID for our example source is 'jacob-00182'.

  1. Back on the 'Edit Study' page, a new row has appeared in the 'Genomic Data Sources' table which corresponds to the new data source we just added. Our next step is to map the samples in this source to the subjects in our annotation source. To begin, click on the 'Map Samples' button under the 'Action' column at the right of the table.

illustration of stepImage Modified
The newly added row (highlighted in red) in the Genomic Data Sources table corresponds to the new genomic data source we added in step 24. Click on the 'Map Samples' button (highlighted in blue) to map the samples to subjects from the annotation source we added in steps 3 to 8.

  1. The 'Edit Sample Mappings' page displays a list of unmapped samples, followed by another list mapping sample IDs to subject IDs. As you can see, the mapping list is empty, which means that none of the samples in this source have been mapped yet! The list of unmapped samples appears under the heading 'Unmapped Samples' and subheading 'Sample Name'. The numbers in this list represent the sample IDs of the unmapped samples.
  2. illustration of stepImage Modified

The 'Edit Sample Mappings' page shows a list of IDs for unmapped samples (highlighted in red).

Your mapping CSV file must map the subject IDs in your annotations to the sample IDs in the unmapped samples list. A screenshot of the mapping file used in this tutorial, taken from a Microsoft Excel 2007 window, is shown below. The file is a table of two columns with no headings; the first column contains IDs of the subjects from the annotation source and the second column contains IDs from the unmapped samples list. Each subject in the left column corresponds to the sample in the right column. Note that the file doesn't map every single sample ID from the data source.

illustration of step.Image Modified
This CSV file maps the subject IDs from our annotation source (left column) to the sample IDs in our genomic source (right column).

To add your mapping CSV file to the study, click on the 'Choose File' button next to the 'Subject to Sample Mapping File' label.

illustration of stepImage Modified
Click on the 'Choose File' button (highlighted in red) to choose a mapping file to open.

In the Open dialog that follows, find your mapping file, click on it, and then click on the 'Open' button. (In our example, the mapping file is named 'mapping_file_tutorial.CSV'.)
illustration of stepImage Modified
To open your mapping file, click on the 'mapping_file_tutorial.CSV' file (highlighted in red), then click on the 'Open' button (highlighted in blue).

...

Since this information may be considered important to your study, we need a way of distinguishing between the cases and controls. The way that caIntegrator addresses this need is with a 'control training file' that lists the sample IDs of all the controls. Any sample that is not listed in this file comes from a case. The screenshot below shows a portion of an example training file in CSV format from a Microsoft Excel 2007 window.

illustration of stepImage Modified
A portion of a control training file listing the sample IDs of all the controls from our example data source. You don't need to understand the format or nomenclature of the sample IDs – they were generated by the instrument or technician who ran the samples.

To add your control training CSV file to the study, click on the 'Choose File' button next to the 'Control Samples File' label.


illustration of stepImage Modified
The filename of the mapping file we just uploaded now appears next to the 'Choose File' button for 'Subject to Sample Mapping File' (highlighted in red). Now click on the 'Choose File' button next to 'Control Samples File' (highlighted in blue) to begin uploading your control training file.

In the Open dialog that follows, find your mapping file, click on it, and then click on the 'Open' button. (In our example, the mapping file is named 'control_training_file_tutorial.CSV'.)

illustration of stepImage Modified
Click on the 'control_training_file_tutorial.CSV' file (highlighted in red), then click on the 'Open' button (highlighted in blue).

  1. Back on the 'Edit Sample Mappings' page, the filename of the control training file you just opened is now displayed to the right of the 'Choose File' button from step 26. Now enter a name for the control sample set in the 'Control Sample Set Name' text field (our example uses 'tutorial controls'), then click on the 'Map Samples' button to map your samples.

illustration of stepImage Modified
The filename of the control training file you just uploaded now appears to the right of the 'Choose File' button (highlighted in red). Enter a title into the 'Control Sample Set Name' text field (highlighted in blue), then click on the 'Map Samples' button (highlighted in green) to map your samples.

  1. Back on the 'Edit Study' page, the new mapping and control files we uploaded are now listed under the File Description column, while the Status has changed from 'Not mapped' to 'Ready to be loaded'. We are now done mapping our samples and are ready to query them.

illustration of stepImage Modified
The mapping file we uploaded now appears under the File Description column and is highlighted in red, while the control file we uploaded is highlighted in green. Under the Status column, the status has changed from 'Not mapped' to 'Ready to be loaded' (highlighted in blue).

  1. To see what obstacles may arise in the course of loading mapping data, let's try another file. This one, named 'duplicate_mapping_file_tutorial.CSV', will replace the one we loaded in steps 26 to 28. A partial screenshot of this file, taken from a Microsoft Excel 2007 window, is shown below.

illustration of stepImage Modified
In this mapping file, the same sample (ID 191) is mapped twice, once to subject ID 5085 (highlighted in red) and again to subject ID 6000 (highlighted in blue).

...

Surprisingly, when we repeat the procedure for loading mappings with the 'duplicate_mapping_file_tutorial.CSV', caIntegrator does not display any error message, and its source's status shows as 'Ready to be loaded' in the 'Genomic Data Sources' table, as was the case with the previous mapping file we loaded successfully. Does this mean that caIntegrator allows multiple mappings of the same sample to different subjects?

illustration of stepImage Modified
When loading an invalid mapping file, caIntegrator does not display any error messages and shows the status of the invalidly mapped source as 'Ready to be loaded' (highlighted in red).

  1. As it turns out, when caIntegrator parses a mapping file in which the same sample is mapped to multiple subjects and encounters a sample ID that has already been mapped, it will overwrite the old mapping with the new one. We can confirm this by clicking on the 'Map Samples' button for the source we mapped and examining the 'Samples Mapped to Subjects' table on the 'Edit Sample Mappings' page.

illustration of stepImage Modified
illustration of stepImage Modified
On the 'Edit Sample Mappings' page, sample ID 191 is only mapped to a single subject (highlighted in red), even though the mapping file we just loaded mapped that same sample twice.

As you can see, the mapping table shows only one mapping for sample ID 191, even though this sample was mapped to two different subjects in the new mapping file we just loaded. The subject ID it's mapped to is 6000 (the second one in the mapping file), not 5085 (the fist one in the mapping file). This means that caIntegrator overwrote the first mapping of sample ID 191 with the second one.

We've learned a valuable lesson from this exercise: be sure to check your mapping file for any duplicates before loading it into your study, as caIntegrator does not perform this check for you!

...

  1. On the 'Edit Study' page, click on the 'My Studies' drop-down list in the blue banner at the top, then click on 'Demo Study for ICR Folks'.

illustration of stepImage Modified
Click on the 'My Studies' drop-down list (highlighted in red), then click on 'Demo Study for ICR Folks' (highlighted in blue).

  1. On the 'Welcome' page, click on the 'Search Demo Study for ICR Folks' link under the 'DEMO STUDY FOR ICR FOLKS' heading in the navigation panel at the left.

illustration of stepImage Modified
Click on 'Search Demo Study for ICR Folks' (highlighted in red) to begin querying the study.

  1. On the 'Search' page, click on the drop-down list under the 'Define Query Criteria' heading. The list shows the different criteria we can query the study by. Since we want to query genomic data, click on 'Gene Expression', then click on the 'Add' button to the right of the list.

illustration of stepImage Modified
Click on the 'Define Query Criteria' drop-down list (highlighted in red), then click on 'Gene Expression' (highlighted in blue) and click on the 'Add' button (highlighted in green).

  1. When querying by gene name, you can either search for a gene symbol or for a fold change. In this example, we'll search by the gene symbol. Click on the 'Gene Name' drop-down list, then click on the 'Gene Name' list entry.

illustration of stepImage Modified
Click on the 'Gene Name' drop-down list (highlighted in red), then click on the 'Gene Name' list entry (highlighted in blue).

  1. In the gene symbol text field that appears to the right, type in 'EGFR' (the symbol for the epidermal growth factor gene), then click on the 'Run Query' button below.

illustration of stepImage Modified
Type 'EGFR' into the 'Gene Symbol' text field (highlighted in red), then click on the 'Run Query' button (highlighted in blue).

...

You can sort these results in numerical order of subject ID by clicking on the 'Subject ID' heading above the right table column.

illustration of stepImage Modified
Click on the Subject ID column heading (highlighted in red) to sort the EGFR gene query results.

  1. As it stands, these query results are not very useful, as they only show which subjects have EGFR expression data and don't show the actual data itself. To change this, click on the 'Results Type' tab at the top of the page, then click on the 'Gene Expression' radio button under the 'Select Results Type' heading. This will change the query results to display one or more numerical values which indicate the expression levels of the EGFR gene for each sample.

illustration of stepImage Modified
Click on the 'Results Type' tab (highlighted in red), then click on the 'Gene Expression' button (highlighted in blue).

  1. In the query results, we can choose to display every EGFR expression value for a given sample, or to display a single value which represents the median of that sample's values. For simplicity's sake, let's choose the latter option by clicking on the 'Gene' button next to 'Select Reporter Type', then clicking on the 'Run Query' button to display the results.

illustration of stepImage Modified
Click on the 'Gene' button (highlighted in red) to display a single value representing each subject's EGFR expression levels in the query results, then click on the Run Query button (highlighted in blue) to display the results.

  1. Back on the 'Query Results' page, there are now two additional columns of data: Sample ID and EGFR. The value in the EGFR column represents the median of the gene's expression levels for the corresponding subject and sample. Note that the screenshot below only displays the first five results in the list; you can scroll down the list via the bar at the right to view the rest of the results.

illustration of stepImage Modified
The query results now show two additional columns: Sample ID and EGFR. The latter represents median EGFR expression values. Click on the 'Save query as…' tab (highlighted in red) to save these results for future reference.

To save this query in caIntegrator for future reference, click on the 'Save query as..' tab at the top of the page, enter a name and description for the query in the respective fields, and click on the 'Save Query' button at the bottom.

illustration of stepImage Modified
Enter a query name and query description in the respective text fields, then click on the 'Save Query' button (highlighted in red) to save the query for future reference.

  1. Once the query is saved, the Search page will reload and the Study Data menu in the left navigation panel will expand to show the newly saved 'Genomic Query' under the 'My Queries' heading. You can click on the magnifying glass icon to the left of the Query link to bring up the query results again, or on the pencil icon to edit the query criteria.

illustration of stepImage Modified
The newly saved 'Genomic query' (highlighted in red) is shown in the 'Study Data' menu under 'My Queries'.

...

Please leave your comment in the caIntegrator End User Forum.

...