NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Scrollbar
iconsfalse

Problem: When uploading data from large microarrays, the size of your data archive may exceed the individual file size limit of 2 GB.

Topic: caArray Usage

Release: caArray 2.0 and above

Date entered: 10/17/2011

Solution

This article presents a workaround which allows you to break down your dataset into smaller, more manageable chunks that can be individually uploaded without violating the 2 GB limit.

Overview

Your experiment dataset consists of an IDF metadata file and its corresponding SDRF metadata file, which, in turn, is associated with one or more raw and derived array data files. In this tutorial, the array files we will use are in the Agilent TXT (raw) and TSV (derived) formats; the file formats for your data may differ.

Depending on the size of your array, the combined size of these files may exceed several gigabytes, even after they are compressed into the ZIP archive format required for uploading to caArray. Since the maximum size of a ZIP file that can be uploaded is 2 GB, any dataset which exceeds this limit must be broken down into smaller chunks, each of which contains a subset of the original data.

The general procedure for breaking down the dataset is as follows:

  1. Divide the array data files into smaller batches, each of which will be no larger than 2 GB following ZIP compression.
  2. Split the original SDRF file into multiple SDRF files, each corresponding to a single batch and referencing only the array data files from that batch.
  3. Create multiple IDF files derived from the original IDF, with each one uniquely referencing one of the SDRF files created in the previous step.
  4. Create a ZIP archive for each batch, containing a single IDF and its associated SDRF and raw and array data files.
  5. Upload each ZIP archive individually, then validate and import the files from each.

Prerequisites

...

This tutorial assumes that you have past experience and basic familiarity with uploading data into caArray. Specifically, it assumes that you have already created an experiment for your data, uploaded the corresponding array design, and associated the experiment with that design. In case you lack a basic background on uploading caArray data, please refer to the official caArray User's Guide on the NCI wiki at

https://wiki.nci.nih.gov/x/LBo9Ag

...

.

...

You

...

must

...

have

...

all

...

your

...

experiment

...

data

...

readily

...

accessible

...

on

...

your

...

computer

...

(i.e.,

...

not

...

archived

...

or

...

compressed).

...

The

...

data

...

should

...

preferably

...

be

...

consolidated

...

into

...

a

...

single

...

location

...

(i.e.,

...

a

...

folder

...

containing

...

every

...

single

...

IDF,

...

SDRF,

...

raw

...

and

...

derived

...

array

...

data

...

file

...

from

...

the

...

experiment).

...

You

...

will

...

also

...

need

...

an

...

archive

...

creation

...

utility

...

installed

...

on

...

your

...

computer.

...

In

...

this

...

tutorial,

...

we

...

will

...

use

...

WinZip

...

(

...

www.winzip.com

...

),

...

but

...

any

...

comparable

...

utility

...

with

...

support

...

for

...

the

...

ZIP

...

format

...

will

...

do.

...

Reference Information

The experiment data used in this tutorial was not generated de novo; it came from an existing experiment whose data is publicly available on the official NCI instance of caArray at https://array.nci.nih.gov/caarray/home.action

...

(Note

...

that

...

you

...

can

...

download

...

this

...

data

...

without

...

registering

...

for

...

an

...

account

...

on

...

the

...

site.)

...

The

...

experiment,

...

entitled

...

"TCGA

...

Ovarian:

...

Comparative

...

Genome

...

Hybridization

...

Analysis

...

Using

...

the

...

Agilent

...

Human

...

Genome

...

CGH

...

244A

...

Platform",

...

was

...

conducted

...

at

...

Harvard

...

Medical

...

School

...

in

...

Boston,

...

MA.

...

It

...

can

...

be

...

accessed

...

via

...

the

...

URL

...

https://array.nci.nih.gov/caarray/project/EXP-498

...

or

...

by

...

searching

...

for

...

the

...

experiment

...

ID

...

'EXP-498'

...

on

...

the

...

NCI

...

caArray

...

instance.

...

The

...

array

...

design

...

used

...

was

...

TCGA-Agilent_HG-CGH-244A;

...

the

...

array

...

design

...

files

...

can

...

be

...

downloaded

...

from

...

the

...

experiment

...

in

...

ADF

...

format,

...

as

...

can

...

all

...

the

...

experiment

...

data,

...

including

...

the

...

IDF

...

and

...

SDRF

...

metadata

...

files,

...

the

...

Agilent

...

TXT

...

raw

...

array

...

data

...

files,

...

and

...

the

...

TSV

...

derived

...

array

...

data

...

files.

...

Getting Started -- Dividing the Array Data Into Batches

The screenshot below shows a portion of the dataset from our sample experiment, including the IDF and SDRF files, as well as some TXT and TSV files.

screenshot illustrating textImage Added

This dataset comprises IDF and SDRF metadata files, as well as the TXT raw array and TSV derived array data files they reference.

The total combined size of all the files in this dataset is a whopping 26.8 GB, which is way too large to be uploaded to caArray at once, even when archived into a single file. Our first step, then, is to break down the dataset into smaller batches, each of which will be no larger than 2 GB following ZIP compression. Since the average ZIP compression ratio of array data is about 2.5:1, we may safely assume that any batch smaller than 5 GB before compression will come out to less than 2 GB after compression.

Before creating the batches, first create a subfolder named 'Batches' in your experiment folder, then create individual subfolders ('batch1', 'batch2', etc.) within that folder for each batch. Now, select multiple TXT and TSV files in your file manager (Windows Explorer in this tutorial), taking care to keep the size of the selection below 5 GB, as shown below:

screenshot illustrating textImage Added

When selecting a subset of your TXT and TSV files in your file manager, make sure the combined size of the selected files is below 5 GB, as anything larger may compress to greater than the 2 GB upload limit caArray imposes for a single ZIP archive.

Info
titleNote

Even though caArray allows archives as large as 2 GB to be uploaded, in this tutorial we will keep the size of archives to approximately 1 GB each to facilitate rapid uploads on slow network connections.

You can now move the file selection to the 'batch1' subfolder we created earlier, as shown below:

Screenshot of selected batch of array data files being moved into its respective subfolderImage Added

Move the selected files to the subfolder you created for this batch.

You can repeat this procedure to create the remaining batches, as summarized below, until every single file in the dataset has been accounted for:

  1. Create a separate subfolder for each new batch
  2. Select multiple data files in your file manager, taking care to keep selection size below 5 GB uncompressed (2 GB compressed)
  3. Move selected files to respective batch folder

Splitting The Original SDRF File

Now that we've created batches of our array data files, our next step is to split the original SDRF file into multiple SDRFs, each corresponding to a single batch and referencing only the array data files from that batch. To do so, first open the original SDRF file in Microsoft Excel or another tab-limited data viewer, as shown below:

Screenshot of Microsoft Excel window showing contents of experiment's SDRF fileImage Added

The SDRF file from your experiment lists all the associated raw array data files under the column headed 'Array Data File'.

As you can see, the column headed 'Array Data File' lists the filenames of all the raw array data files from the experiment. The first 40 rows correspond to all the data files from the first batch we created in the previous section, Getting Started. We can generate a unique SDRF file for this batch by deleting all the other rows from the file -- except, of course, for the top header row -- and saving the modified file as a new SDRF with a different filename from the original. (The convention used in this tutorial is to prefix the original SDRF filename with a number representing the batch, followed by a period. For example, if the original SDRF filename is 'hms.harvard.edu_OV.HG-CGH-244A_1.6.0.sdrf',

...

then

...

the

...

filename

...

of

...

the

...

first,

...

or

...

'zeroeth'

...

batch,

...

would

...

be

...

'0.hms.harvard.edu_OV.HG-CGH-244A_1.6.0.sdrf'.)

...

Once

...

you've

...

generated

...

a

...

new

...

SDRF

...

file,

...

copy

...

it

...

over

...

to

...

its

...

respective

...

batch's

...

folder.

...

For

...

example,

...

'0.hms.harvard.edu_OV.HG-CGH-244A_1.6.0.sdrf'

...

would

...

be

...

copied

...

to

...

the

...

'batch1'

...

folder

...

containing

...

all

...

the

...

array

...

data

...

files

...

from

...

the

...

first

...

batch

...

we

...

created.

...

You

...

can

...

generalize

...

this

...

procedure

...

to

...

the

...

entire

...

original

...

SDRF

...

file,

...

and

...

thus

...

all

...

the

...

batches

...

from

...

your

...

dataset,

...

by

...

following

...

these

...

steps:

...

  1. Open

...

  1. the

...

  1. original

...

  1. SDRF

...

  1. file

...

  1. and

...

  1. locate

...

  1. the

...

  1. rows

...

  1. referencing

...

  1. the

...

  1. array

...

  1. files

...

  1. for

...

  1. the

...

  1. next

...

  1. batch

...

  1. Delete

...

  1. all

...

  1. other

...

  1. rows

...

  1. except

...

  1. for

...

  1. the

...

  1. top

...

  1. header

...

  1. row

...

  1. Save

...

  1. the

...

  1. modified

...

  1. SDRF

...

  1. as

...

  1. a

...

  1. new

...

  1. file

...

  1. with

...

  1. a

...

  1. filename

...

  1. unique

...

  1. to

...

  1. its

...

  1. respective

...

  1. batch

...

  1. Copy

...

  1. the

...

  1. newly

...

  1. generated

...

  1. SDRF

...

  1. to

...

  1. its

...

  1. respective

...

  1. batch's

...

  1. folder

Creating a Unique IDF File For Each Batch

Once you've

...

generated

...

a

...

unique

...

SDRF

...

file

...

for

...

each

...

batch,

...

you

...

must

...

also

...

generate

...

a

...

unique

...

IDF

...

file

...

which

...

references

...

that

...

SDRF

...

file.

...

You

...

can

...

do

...

so

...

simply

...

by

...

opening

...

the

...

original

...

IDF

...

file

...

and

...

editing

...

the

...

field

...

'SDRF

...

Files'

...

with

...

the

...

filename

...

of

...

the

...

SDRF

...

you

...

wish

...

to

...

reference,

...

as

...

shown

...

below:

Screenshot of Microsost Excel window showing how to edit 'SDRF Files' field in IDF fileImage Added

Edit the field 'SDRF Files' field in your IDF file to reflect the file name of the new SDRF file you generated previously.

In this example, the originally referenced SDRF filename 'hms.harvard.edu_OV.HG-CGH-244A_1.6.0.sdrf'

...

has

...

been

...

changed

...

to

...

'0.hms.harvard.edu_OV.HG-CGH-244A_1.6.0.sdrf',

...

which

...

is

...

the

...

SDRF

...

for

...

the

...

first

...

batch

...

we

...

created.

...

As

...

with

...

the

...

SDRF

...

files

...

we

...

modified

...

in

...

the

...

previous

...

section,

...

be

...

sure

...

to

...

save

...

the

...

modified

...

IDF

...

file

...

as

...

a

...

new

...

IDF

...

with

...

the

...

same

...

filename

...

as

...

its

...

referenced

...

SDRF,

...

but

...

with

...

the

...

'IDF'

...

extension

...

instead

...

of

...

'SDRF'.

...

For

...

example,

...

the

...

file

...

which

...

references

...

'0.hms.harvard.edu_OV.HG-CGH-244A_1.6.0.sdrf'

...

would

...

be

...

named

...

'0.hms.harvard.edu_OV.HG-CGH-244A_1.6.0.idf'.

...

Finally,

...

copy

...

this

...

IDF

...

file

...

over

...

to

...

its

...

respective

...

batch's

...

folder

...

containing

...

the

...

referenced

...

SDRF

...

file

...

and

...

all

...

the

...

associated

...

array

...

data

...

files.

...

You

...

can

...

repeat

...

this

...

procedure

...

for

...

all

...

your

...

batches.

...

In

...

summary:

...

  1. Open

...

  1. the

...

  1. original

...

  1. IDF

...

  1. file

...

  1. and

...

  1. locate

...

  1. the

...

  1. 'SDRF

...

  1. Files'

...

  1. field.

...

  1. Edit

...

  1. this

...

  1. field

...

  1. to

...

  1. reflect

...

  1. the

...

  1. file

...

  1. name

...

  1. of

...

  1. the

...

  1. SDRF

...

  1. file

...

  1. you

...

  1. wish

...

  1. to

...

  1. reference.

...

  1. Save

...

  1. the

...

  1. modified

...

  1. IDF

...

  1. file

...

  1. with

...

  1. a

...

  1. unique

...

  1. filename

...

  1. that

...

  1. is

...

  1. parallel

...

  1. to

...

  1. the

...

  1. referenced

...

  1. SDRF's

...

  1. filename.

...

  1. Copy

...

  1. the

...

  1. newly

...

  1. generated

...

  1. IDF

...

  1. to

...

  1. its

...

  1. respective

...

  1. batch's

...

  1. folder

Creating the Archives


Now that we've

...

divided

...

our

...

dataset

...

into

...

batches

...

and

...

generated

...

the

...

corresponding

...

IDF

...

and

...

SDRF

...

files

...

for

...

each,

...

our

...

next

...

step

...

is

...

to

...

create

...

a

...

ZIP

...

archive

...

of

...

each

...

batch.

...

Launch

...

WinZip,

...

click

...

the

...

'New'

...

toolbar

...

button,

...

and

...

enter

...

a

...

name

...

for

...

your

...

archive

...

in

...

the

...

'New

...

Archive'

...

dialog.

...

We'll

...

call

...

ours

...

'upload.zip',

...

as

...

shown

...

below.


screenshot illustrating textImage Added

In WinZip's

...

'New

...

Archive'

...

dialog, specify a filename for the data archive to be created ('upload.zip in our example').

Once we've created the archive, we can now add files to it. We can refer to our previous notes of all the filenames associated with our IDF file. In our example, the archive will consist of a total of 42 files: one IDF, one SDRF, 20 TXT, and 20 TSV files. We can select these files in the 'Add' dialog as shown below, then click the 'Add' button at the bottom to begin creating the archive. (Hint: Hold down the CTRL key to select multiple files.)


screenshot illustrating textImage Added

In WinZip's 'Add' dialog, select all the related IDF, SDRF, raw data, and derived data files (a total of 42 files in our example), then click the 'Add' button below to begin creating the archive.

Warning
titleWarning

After you've created the archive, ensure that the resulting file size is less than 2 GB. If it isn't, you will either have to re-create the archive with a higher compression ratio, or subdivide the batch into smaller batches. In our example, the size of the 'upload.zip' archive came out to approximately 900 MB, as shown below, so the file is ready to upload as is.


screenshot illustrating textImage Added

In our example, the 'upload.zip' data archive we created is approximately 900 MB in size, which is below the 2 GB upload limit. If your data archive turns out to be larger than 2 GB, you will not be able to upload it until you re-create it with a higher compression ratio.

Uploading the Archive


To upload the archive, first log in to caArray and navigate to the experiment you will be upload your data into, then select the 'Data' tab, followed by the 'Manage Data' tab beneath it. Now click on the 'Upload New Files' button as shown below.


screenshot illustrating textImage Added

Click the 'Upload New Files' button under the 'Manage Data' tab to specify the location of your data archive.

A new pop-up window entitled 'Experiment Data Upload' will appear in your Web browser, prompting you to upload files. Click on the 'Browse' button, then select the 'upload.zip' archive we created previously from the Open dialog as shown below.

screenshot illustrating textImage Added
screenshot illustrating textImage Added

In the 'Experiment Data Upload' pop-up window, click the 'Browse' button, then in the 'File Upload' dialog, navigate to the ZIP data archive we created previously and click on the 'Open' button.

Back in the 'Experiment Data Upload' window, make sure that the box labeled 'Unpack Compressed Archive' is checked, then click on the 'Upload' button to begin uploading the file.


Screenshot of 'Experiment Data Upload' window showing how to begin uploading the data archiveImage Added

Back in the 'Experiment Data Upload' window, make sure that the box labeled 'Unpack Compressed Archive' is checked, then click on the 'Upload' button to begin uploading the file.

Depending on the size of the archive, the performance of your caArray server, and your network bandwidth, it may take anywhere from five to 30 minutes -- and possibly longer -- for the archive to upload. Remember to keep the upload window open throughout the entire upload process, even after the blue progress bar has reached 100%. (For reference, on a caArray server running a quad-core 2.33 Ghz Intel(R) Xeon(R) 5148 CPU with 16 GB of memory, the total time required to extract and process a 1.1 GB upload after the progress bar had reached 100% was about 13 minutes and 30 seconds.)


Screenshot of 'Experiment Data Upload Window' showing progress of file uploadImage Added

Even when the blue upload progress bar reaches 100%, do not close the 'Experiment Data Upload' window. You will be notified when the upload is complete.

You'll know when the upload is complete when you see a new window overlaid over the upload window with the message 'Your file upload is complete', as shown below. Click the 'OK' button below this message, then click on the 'Close Window' button behind it to return to the main experiment window.


screenshot illustrating textImage Added
You'll know when the upload is complete when you see a new window overlaid over the upload window with the message, 'Your file upload is complete'.

Validating the Archive

Back in the main experiment window, the contents of the archive we just uploaded are now listed under the 'Manage Data' tab. The TSV matrix files are considered supplemental, so we will move them to the 'Supplemental Files' tab by first using the 'Filter By File Type' drop-down to show only TSV files, then checking off all the TSV files in the list, and finally clicking on the 'Add Supplemental Files' button below.

Screenshot of 'Manage Data' tab showing how to mark derived array data files as supplementalImage Added
screenshot illustrating textImage Added

You can mark the derived array data files as supplemental by checking them off under the 'Manage Data' tab, then clicking the 'Add Supplemental Files' button.

These TSV files now appear under the 'Supplemental Files' tab, alongside other TSV files from a previous upload to the same experiment.

screenshot illustrating textImage Added

The derived array data files we checked off under the 'Manage Data' tab now appear under the 'Supplemental Files' tab, alongside other such files from a previous upload to the same experiment.

Back on the 'Manage Data' tab, the remaining files from our upload are one IDF, one SDRF, and 20 TXTs (only the first three of these files is shown below due to space constraints). Note that the status of the TXT file from the screenshot (and of all other TXT files in the list) shows as 'Unknown', which means that caArray did not automatically recognize the file type in this particular case. As a result, we will have to manually specify the file type ourselves by first using the 'Filter By File Type' drop-down to show only TXT files, then checking off all the TXT files in the list, and finally clicking the 'Change File Type' button below.


screenshot illustrating textImage Added
screenshot illustrating textImage Added
Since caArray didn't automatically recognize the format of the array data files we uploaded, we must manually specify the format ourselves by selecting the files under the 'Manage Data' tab, then clicking the 'Change File Type' button.

For the particular data in this example, the array data files are in the Agilent Raw TXT format. To specify this, in the 'Manage Files' window shown below, select 'Agilent Raw TXT' from the 'Select New File Type' drop-down list, then click on the 'Save' button above it.

Info
titleNote

Depending on the assay type and array design used in your own experiment, your data may be in a different format, in which you will have to select that format from the drop-down list, or the file type may be automatically recognized by caArray, in which case you won't have to manually specify it yourself.


screenshot illustrating textImage Added

Manually specify the format of the uploaded array data files by selecting the appropriate format (Agilent Raw TXT in this example) from the 'Select New File Type' drop-down list.

Back on the 'Manage Data' window, the status of all the TXT files now shows as 'Agilent Raw TXT', indicating that caArray now correctly recognizes the file type.


screenshot illustrating textImage Added

Back on the 'Manage Data' window, the format of all the originally unrecognized array data files now shows under the 'File Type' column (as Agilent Raw TXT in our example), indicating that caArray now correctly recognizes the file type.

Our next step is to validate all the files, which we will do so by checking off every single file in the list (IDF, SDRF, and TXT), then clicking the 'Validate' button below.


screenshot illustrating textImage Added

To begin verifying the uploaded data, check off all the array data files under the 'Manage Data' tab, then click the 'Validate' button.


The page will now refresh with the updated status of the selected files showing as 'In Queue'. Depending on the size of the files and the performance of your server, the TXT files may take several minutes to validate, so be patient. Note that the page will not automatically refresh once the files have finished validating, so you will have to manually refresh the page yourself by periodically clicking on the 'Refresh Status' at the bottom of the window until the file status updates again.


screenshot illustrating textImage Added

The 'Manage Data' tab now refreshes with the status of the array data files showing as 'In Queue'.

You'll know when the validation is successful when the status of the files shows as 'Validated' or 'Validated, Not Parsed'.

Info
titleNOTE

The 'Not Parsed' status would only show in versions of caArray prior to v2.4.0 which had not yet implemented a parser for the Agilent TXT format and were thus unable to parse these files. Either way, these files can still be imported into your experiment with or without having been parsed beforehand.

Once the files have been validated, you can import them into the study by checking all the files in the list, then clicking on the 'Import' button below.

 

screenshot illustrating textImage Added
screenshot illustrating textImage Added

Once the data finishes validating, the 'Manage Data' tab will appear with the status of the array data files showing as 'Validated' or 'Validated (Not Parsed)', depending on the version of caArray you're running. To import the files, select them all, then click the 'Import' button.

The page will again refresh with the files' status showing as 'Importing'. After a few minutes, click the 'Refresh Status' until the file status updates again.


screenshot illustrating textImage Added

The 'Manage Data' tab now refreshes with the status of all the selected files showing as 'Importing'.

You'll know when the importing is successful when the uploaded files no longer appear under the 'Manage Data' tab, with a message stating, 'Nothing Found To Display' in their place, as shown below.

Screenshot showing message as describedImage Added

The files now appear under the 'Imported Data' tab, as shown below, with a status of 'Imported'. Note that other, previously uploaded files from the same experiment appear under this tab as well alongside the files we just imported.

screenshot illustrating textImage Added

The imported files now appear under the 'Imported Data' tab with a status of 'Imported' alongside other files from a previous upload to the same experiment.

Reproducing the Procedure

So far, only one-sixth of the data has been uploaded. You can reproduce the procedure we followed so far to upload the data from your experiment. The procedure, summarized below, is as follows:

  • Create a ZIP archive for each batch which contains the IDF, SDRF, and all the associated TXT files, ensuring that the size of the archive is less than 2 GB following compression.
  • Upload the ZIP archive to your caArray instance
  • Depending on the format of your raw array data, manually specify the file type for the array data files, as they may not automatically recognized by caArray
  • Validate the uploaded files
  • Import the validated files into the experiment

Have a comment?

Please leave your comment in the caArray End User Forum.

Scrollbar
iconsfalse