This document provides a quick overview of the data submission process. For detailed information on the process, see the Data Submission Guide. Contact the DCC for more information or if you have questions.
Before attempting to submit an archive, you must meet these prerequisites:
- If your center has never submitted data to TCGA, contact the DCC at least a week in advance to create your account.
- If your center has submitted data, but you have not done so, be sure and determine your center type
- Know your sftp login username and password. NCI requires that a single user at each institution be responsible for maintenance of the username and password, so you may need to contact that person to gain access. Send an email to email@example.com if you need to reset your password.
Building an Archive
TCGA accepts data through TCGA Archives and these have a very specific structure. While this document does not address all aspects of archives, you should note several key points.
- Most data submissions require two archives: the data archive and the MAGE-TAB (metadata) archive.
- You must follow data archive naming standards.
- The revision number is important for updating archives (see below).
- The MAGE-TAB archive that corresponds to the data archive contains meta-data about the data archive and uses a naming convention similar to the data archive.
- There is only a single MAGE-TAB archive for any given disease and platform combination, regardless of how many data archives are present.
Two important archive components allow verification of file integrity.
MANIFEST.txtfiles are part of both the data archive and the MAGE-TAB archive. The
MANIFEST.txtfile contains a list of all the files contained in that archive and their associated md5sum values. The
MANIFEST.txtfile is a mandatory part of any archive submitted to the DCC.
- All archives must have accompanying md5sum values. For example, the archive
my.institute.org.Disease.MyPlatform.Level_22.214.171.124.tar.gzmust be accompanied by the file
my.institute.org.Disease.MyPlatform.Level_126.96.36.199.tar.gz.md5containing the md5sum value for the compressed archives. Failure to include the
MANIFEST.txtor the archive md5sum files will result in the archive failing validation.
Once you have created your data and MAGE-TAB archives (and associated md5sum values), be sure to check them with the most current version of the (locally-run) TCGA Client Side Validator for errors before submitting them.
The Mac tar command
The native tar command on Macintosh computers produces invalid archives. Users who are creating archives on a Mac must use the gnutar command instead.
Beware Hidden Files
If there are hidden files in your archives (either the UNIX style .filename or a Windows file hidden by attribute), the archive will fail validation. Be sure to check your data directories for hidden files before you create your archives! For those using OSX, note that the tar command will introduce hidden files into your archive unless the COPYFILE_DISABLE environment variable is set:
Submitting an Archive
Once you have checked your data archive and your MAGE-TAB archive, you will upload both, using sftp, to
firstname.lastname@example.org. All data deposited in your root directory is automatically processed by QCLive. No other directory (including the 'other' directory) is visited by QCLive.
Revising an Existing Archive
An existing archive can be revised by adding or removing data. The following sections discuss the requirements for performing either of these tasks.
When adding data to an existing archive, you do not need to upload previously submitted data. Instead, create a data archive that contains only the new data and use the same archive name as before, but increment the revision number. For example, to update deployed archive
my.institute.org_Disease.MyPlatform.Level_188.8.131.52, create and submit the new archive
my.institute.org_Disease.MyPlatform.Level_184.108.40.206. Be very sure that the
MANIFEST.txt file in the data archive and the SDRF file in the mage-tab archive contain references to all the data files, both old and new, that comprise the complete data set. Keep in mind that this requires you to maintain a list of the old files’ md5 checksums; these must match the checksums currently at the DCC. When QCLive runs, it looks both in the newly uploaded files and in the latest deployed archive for the files listed in
MANIFEST.txt. Older archives are not examined for data files. In essence, the
MANIFEST.txt and SDRF files define what is considered a complete set of data. Note that by leaving older files out of the
MANIFEST.txt and SDRF, you are in effect removing those data .
Technically, data are never removed from the TCGA site, but by changing the files in
MANIFEST.txt and SDRF files you can change what is considered current. To remove files no longer considered current from the data archive, simply delete those files from the
MANIFEST.txt and the SDRF and submit new data mage-tab archives with incremented revision numbers.
When Good Submissions Go Bad
Despite best intentions, errors do creep into submissions. If you have a submission that fails validation, read through the returned errors; they almost always point to the problem. If you do not understand the error, or the error is one that you cannot correct, email the DCC at email@example.com.