Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Page tree
Skip to end of metadata
Go to start of metadata

A data archive is a compressed directory of experimental result files for a set of samples. An individual archive comes from a particular data submission center, for a specific platform and disease study. An experiment can be represented across multiple archives.

Data Archive Background

Each center transfers its data to the DCC in digital compressed file directories known as data archives. All archives from the same center type include common documents and follow distinct file structure and naming conventions. A TCGA experiment is likely to be represented across many archives.

The following table identifies certain concepts particular to TCGA archives. Click a concept for a detailed description:

Concept

Description

Structure

Archives are flat; they do not contain subdirectories

Name

Archive names follow a specific format

Content

Archive contents are well-defined and include informational text files (like the manifest file) and data files

Compression

Archives are compressed before transfer

Integrity

Archives are accompanied by a corresponding MD5 file to ensure its integrity

Archive Naming Convention

Archives are named using a specific naming convention. Delimiters, such as underscore and period, are explicit and intentional. An underscore at the beginning of the name separates the domain name from the rest of the archive name, while periods separate parts of the center's domain or parts of the rest of the archive name. Archive naming schemes are case-sensitive (as are the files they contain).

Archive Name Format

<domain>_<disease study>.<platform>.<archive type>.<serial index>.<revision>.<series>

Label

Description

domain

The domain for a TCGA center is the Internet domain name associated with the submitting center's institution. Even if there is involvement from other centers, the domain reflects only the submitting center.

For example, broad.mit.edu is the domain for the Broad Institute at MIT, and mskcc.org is the domain for Memorial Sloan-Kettering Cancer Center.

disease study

A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study. Within the project, a disease is referred to by its abbreviation.For example, Glioblastoma multiforme is represented by the abbreviation GBM.

A complete list of disease studies and their abbreviations is found in the Code Tables Report.

platform

A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or GCC. This is represented by a platform code.

For a complete list of platform codes, see the column "Platform Alias" in the platforms code report.

archive type

The archive type is the classification of a TCGA archive. For a Data Level Archive, this value is either 'Level_1', 'Level_2' or 'Level_3'. For a MAGE-TAB Archive, this value is 'mage-tab'. For an Auxiliary Archive, this value is 'aux'.

serial index

Archives corresponding to the same <domain>_<disease study>.<platform> will have one and only one corresponding mage-tab archive. Conventionally, the serial index of the mage-tab archive is 1, however, the serial index is chosen by the submitting center.

For other archive types, the serial index is a number that uniquely identifies an independent data set from a particular experiment. There is no overlap of data files between archives of differing serial numbers. A numbering is entirely up to the data submission center. In general, BCRs use a serial index equivalent to a batch number while other center types start serial index series from 1.

revision number

A revision number can indicate the number of times an archive has been revised (starting from 0) and submitted to the DCC. However, the only requirement for revision numbers is that the revision number of the new archive is to be higher than that of the archive being replaced. Files that have been changed or added are captured in the changes and additions files, respectively.

series number

This feature is currently disabled, the series number should always be 0.

Files Types in an Archive

There are two types of files in an archive transferred to the DCC:

  • Archive description files:

    An archive description file is an ASCII text file that contains information about the other data files within the same archive. It is named by its function and is always capitalized. The following archive description files can be found in TCGA archives: MANIFEST.txt, DESCRIPTION.txt, README_DCC.txt and CHANGES_DCC.txt.
  • Center-specific files:

    A center-specific file contains the data associated with an assay or experiment across various data levels.

Accessing Archives

Archives submitted to the DCC are validated against standardized formats before being distributed as both downloadable files and application-accessible data. The research community can search and access the data through the following methods and applications (all of which are accessible via the Data Portal):

Except for HTTP data access, all include the ability to search/filter archives based on parameters such as cancer type, center, platform, data type and submission date.

Ensuring Data Integrity

According to best practice, users should ensure that each file downloaded from TCGA has not been corrupted during file transfer. This is especially important for very large archives. MD5 hash files are available to easily confirm the integrity of archived files. A TCGA archive is always accompanied by a corresponding MD5 hash file. Archive and MD5 hash file names differ only in their extensions, as in the following example.

Archive

MD5 file

broad.mit.edu_GBM.HT_HG-U133A.1.0.0.tar.gz

broad.mit.edu_GBM.HT_HG-U133A.1.0.0.tar.gz.md5

MD5 hash values are also available for each file contained in an archive. These values are stored in a manifest file, also present within the archive.

For information on checking MD5 hash values, see MD5.

Notification of New Archives

When new archives are available at the DCC, a notification is sent to the TCGA-DATA-L mailing list. Subscribe to this mailing list to receive these notifications.

Data Freeze

A data freeze is the capture of publication data used and produced by an Analysis Working Group and the distribution of this information via a Publication page. A data freeze allows published results to be reproduced at a later date if desired.

  1. For disease marker papers (a global analysis publication), the DCC is responsible for working with the AWG to put together a publication page referring to all the published data in the publication.
  2. For disease follow-on papers, where the TCGA Network is one of the authors, the authors have the option to ask the DCC to host a publication page referring to published DCC data. The DCC is not responsible for creating the publication page. Authors are responsible for creating a publication page using a DCC supplied template. Two weeks notification to the DCC is required prior to generating this publication page.
  3. For disease follow-on papers, where the TCGA Network is not one of the authors, the DCC is not involved with the publication page.
A data freeze is requested by an Analysis Working Group and the DCC captures it in a freeze list. This is done so that working groups use a common set of data for calculations, analyses and findings. A data freeze associated with a publication allows published results to be reproduced at a later date if desired.

When a new freeze list is available, notifications are sent to the following locations:

  • No labels