This applications authentication system has been updated Dec 7th, please report access problems to the Helpdesk at 301-496-4357.
Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Skip to end of metadata
Go to start of metadata
Contents of this Page


This document provides a quick overview of the data submission process. For detailed information on the process, see the Data Submission Guide. Contact the DCC for more information or if you have questions.

Prerequisites

Before attempting to submit an archive, you must meet these prerequisites:

  1. If your center has never submitted data to TCGA, contact the DCC at least a week in advance to create your account.
  2. If your center has submitted data, but you have not done so, be sure and determine your center type
  3. Know your sftp login username and password. NCI requires that a single user at each institution be responsible for maintenance of the username and password, so you may need to contact that person to gain access. Send an email to ncicbmb@mail.nih.gov if you need to reset your password.

Building an Archive

TCGA accepts data through TCGA Archives and these have a very specific structure. While this document does not address all aspects of archives, you should note several key points.

  • Most data submissions require two archives: the data archive and the MAGE-TAB (metadata) archive.
  • You must follow data archive naming standards.
  • The revision number is important for updating archives (see below).
  • The MAGE-TAB archive that corresponds to the data archive contains meta-data about the data archive and uses a naming convention similar to the data archive.
  • There is only a single MAGE-TAB archive for any given disease and platform combination, regardless of how many data archives are present.

Two important archive components allow verification of file integrity.

  1. The MANIFEST.txt files are part of both the data archive and the MAGE-TAB archive. The MANIFEST.txt file contains a list of all the files contained in that archive and their associated md5sum values. The MANIFEST.txt file is a mandatory part of any archive submitted to the DCC.
  2. All archives must have accompanying md5sum values. For example, the archive my.institute.org.Disease.MyPlatform.Level_1.1.1.0.tar.gz must be accompanied by the file my.institute.org.Disease.MyPlatform.Level_1.1.1.0.tar.gz.md5 containing the md5sum value for the compressed archives. Failure to include the MANIFEST.txt or the archive md5sum files will result in the archive failing validation.

Once you have created your data and MAGE-TAB archives (and associated md5sum values), be sure to check them with the most current version of the (locally-run) TCGA Client Side Validator for errors before submitting them.

The Mac tar command

The native tar command on Macintosh computers produces invalid archives. Users who are creating archives on a Mac must use the gnutar command instead.

Beware Hidden Files

If there are hidden files in your archives (either the UNIX style .filename or a Windows file hidden by attribute), the archive will fail validation. Be sure to check your data directories for hidden files before you create your archives! For those using OSX, note that the tar command will introduce hidden files into your archive unless the COPYFILE_DISABLE environment variable is set:

COPYFILE_DISABLE=1 gnutar czvf my.institute.org.Disease.MyPlatform.Level_1.1.1.0.tar.gz my.institute.org.Disease.MyPlatform.Level_1.1.1.0

Submitting an Archive

Once you have checked your data archive and your MAGE-TAB archive, you will upload both, using sftp, to username@tcgaftps.nci.nih.gov. All data deposited in your root directory is automatically processed by QCLive. No other directory (including the 'other' directory) is visited by QCLive.

Revising an Existing Archive

An existing archive can be revised by adding or removing data. The following sections discuss the requirements for performing either of these tasks.

Adding Data

When adding data to an existing archive, you do not need to upload previously submitted data. Instead, create a data archive that contains only the new data and use the same archive name as before, but increment the revision number. For example, to update deployed archive my.institute.org_Disease.MyPlatform.Level_1.1.1.0, create and submit the new archive my.institute.org_Disease.MyPlatform.Level_1.1.2.0. Be very sure that the MANIFEST.txt file in the data archive and the SDRF file in the mage-tab archive contain references to all the data files, both old and new, that comprise the complete data set. Keep in mind that this requires you to maintain a list of the old files’ md5 checksums; these must match the checksums currently at the DCC. When QCLive runs, it looks both in the newly uploaded files and in the latest deployed archive for the files listed in MANIFEST.txt. Older archives are not examined for data files. In essence, the MANIFEST.txt and SDRF files define what is considered a complete set of data. Note that by leaving older files out of the MANIFEST.txt and SDRF, you are in effect removing those data .

Removing Data

Technically, data are never removed from the TCGA site, but by changing the files in MANIFEST.txt and SDRF files you can change what is considered current. To remove files no longer considered current from the data archive, simply delete those files from the MANIFEST.txt and the SDRF and submit new data mage-tab archives with incremented revision numbers.

When Good Submissions Go Bad

Despite best intentions, errors do creep into submissions. If you have a submission that fails validation, read through the returned errors; they almost always point to the problem. If you do not understand the error, or the error is one that you cannot correct, email the DCC at tcga-dcc-binf-l@list.nih.gov.

  • No labels