![]() |
![]() |
![]() |
A data archive is a compressed directory of experimental result files for a set of samples. An individual archive comes from a particular data submission center, for a specific platform and disease study. An experiment can be represented across multiple archives.
Data Archive Background
Each center transfers its data to the DCC in digital compressed file directories known as data archives. All archives from the same center type include common documents and follow distinct file structure and naming conventions. A TCGA experiment is likely to be represented across many archives.
The following table identifies certain concepts particular to TCGA archives. Click a concept for a detailed description:
Concept |
Description |
---|---|
Archives are flat; they do not contain subdirectories |
|
Archive names follow a specific format |
|
Archive contents are well-defined and include informational text files (like the manifest file) and data files |
|
Archives are compressed before transfer |
|
Archives are accompanied by a corresponding MD5 file to ensure its integrity |
Archive Naming Convention
Archives are named using a specific naming convention. Delimiters, such as underscore and period, are explicit and intentional. An underscore at the beginning of the name separates the domain name from the rest of the archive name, while periods separate parts of the center's domain or parts of the rest of the archive name. Archive naming schemes are case-sensitive (as are the files they contain).
Archive Name Format
<domain>_<disease study>.<platform>.<archive type>.<serial index>.<revision>.<series>
Label |
Description |
---|---|
The domain for a TCGA center is the Internet domain name associated with the submitting center's institution. Even if there is involvement from other centers, the domain reflects only the submitting center. |
|
A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study. Within the project, a disease is referred to by its abbreviation.For example, Glioblastoma multiforme is represented by the abbreviation GBM. |
|
A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or GCC. This is represented by a platform code. |
|
archive type |
The archive type is the classification of a TCGA archive. For a Data Level Archive, this value is either 'Level_1', 'Level_2' or 'Level_3'. For a MAGE-TAB Archive, this value is 'mage-tab'. For an Auxiliary Archive, this value is 'aux'. |
Archives corresponding to the same <domain>_<disease study>.<platform> will have one and only one corresponding mage-tab archive. Conventionally, the serial index of the mage-tab archive is 1, however, the serial index is chosen by the submitting center. For other archive types, the serial index is a number that uniquely identifies an independent data set from a particular experiment. There is no overlap of data files between archives of differing serial numbers. A numbering is entirely up to the data submission center. In general, BCRs use a serial index equivalent to a batch number while other center types start serial index series from 1. |
|
A revision number can indicate the number of times an archive has been revised (starting from 0) and submitted to the DCC. However, the only requirement for revision numbers is that the revision number of the new archive is to be higher than that of the archive being replaced. Files that have been changed or added are captured in the changes and additions files, respectively. |
|
This feature is currently disabled, the series number should always be 0. |
Files Types in an Archive
There are two types of files in an archive transferred to the DCC:
Archive description files:
An archive description file is an ASCII text file that contains information about the other data files within the same archive. It is named by its function and is always capitalized. The following archive description files can be found in TCGA archives: MANIFEST.txt, DESCRIPTION.txt, README_DCC.txt and CHANGES_DCC.txt.Center-specific files:
A center-specific file contains the data associated with an assay or experiment across various data levels.
Accessing Archives
Archives submitted to the DCC are validated against standardized formats before being distributed as both downloadable files and application-accessible data. The research community can search and access the data through the following methods and applications (all of which are accessible via the Data Portal):
- Data Matrix
- Bulk download
- TCGA File Search
- HTTP data access
- Latest Archive Report
- Experiment Aliquot Report
Except for HTTP data access, all include the ability to search/filter archives based on parameters such as cancer type, center, platform, data type and submission date.
Ensuring Data Integrity
According to best practice, users should ensure that each file downloaded from TCGA has not been corrupted during file transfer. This is especially important for very large archives. MD5 hash files are available to easily confirm the integrity of archived files. A TCGA archive is always accompanied by a corresponding MD5 hash file. Archive and MD5 hash file names differ only in their extensions, as in the following example.
Archive |
MD5 file |
---|---|
broad.mit.edu_GBM.HT_HG-U133A.1.0.0.tar.gz |
broad.mit.edu_GBM.HT_HG-U133A.1.0.0.tar.gz.md5 |
MD5 hash values are also available for each file contained in an archive. These values are stored in a manifest file, also present within the archive.
For information on checking MD5 hash values, see MD5.
Notification of New Archives
When new archives are available at the DCC, a notification is sent to the TCGA-DATA-L mailing list. Subscribe to this mailing list to receive these notifications.
Data Freeze
A data freeze is the capture of publication data used and produced by an Analysis Working Group and the distribution of this information via a Publication page. A data freeze allows published results to be reproduced at a later date if desired.
- For disease marker papers (a global analysis publication), the DCC is responsible for working with the AWG to put together a publication page referring to all the published data in the publication.
- For disease follow-on papers, where the TCGA Network is one of the authors, the authors have the option to ask the DCC to host a publication page referring to published DCC data. The DCC is not responsible for creating the publication page. Authors are responsible for creating a publication page using a DCC supplied template. Two weeks notification to the DCC is required prior to generating this publication page.
- For disease follow-on papers, where the TCGA Network is not one of the authors, the DCC is not involved with the publication page.
When a new freeze list is available, notifications are sent to the following locations:
- TCGA-DATA-L mailing list
- TCGA-GDAC-L mailing list
- TCGA_ANALYSIS_WG mailing list
- Announcements on TCGA Data Portal