Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Page tree
Skip to end of metadata
Go to start of metadata
Document Information

GSC MAGE-TAB Specification
Version 1.0
November 13, 2013

Contents

Background

TCGA Genome Sequencing Centers (GSCs) typically generate sequence-based data that are submitted to the DCC in the form of mutation reports (e.g., MAF, VCF files). While VCF files allow for capturing some details about the experimental process associated with the data, MAF files do not contain adequate metadata. End users do not have a complete perspective of the data generation process that would allow them to interpret or replicate the results.

Users usually want to be able to view a snapshot of the mutation calling process to understand the data. Some parameters are included in BAM metadata (e.g., Library_Strategy) and can be accessed at CGHub. A few sample questions that MAGE-TAB can easily provide answers to are listed below. Some of these are based on real user queries received by the DCC.

  • What is the pipeline (and arguments) that produce the content of each MAF file from the BAM files?
  • How can I easily access details about the sequence library used?
  • What annotation file is being used?
  • Who should I contact to find more details about a specific experimental protocol associated with the sequencing data deposited at the DCC?
  • Are there are any publications that describe this data set?

Purpose

MAGE-TAB provides a common platform for sharing experimental details and relationships between data files. This document will serve as a specification for how MAGE-TAB files for sequencing data should be generated. Examples are provided when possible. However, each dataset can have its own unique set of constraints and features so the files should be modeled to clearly depict an experimental process while conforming to the core specification.

MAGE-TAB

Refer to the MAGE-TAB TCGA encyclopaedia page for details. The official MAGE-TAB specification is available.

To capture experiment details and the relationships between related data files (i.e., data files from different stages of sample data as protocols are continuously applied to it) TCGA uses the MAGE-TAB standard. MAGE-TAB files are tab-delimited text files that model data in the form of columns and rows and are able to capture complex experimental relationships such as an entire study using multiple assays.

MAGE-TAB format uses two different types of files to capture information about an experiment. Click on each file type to learn more about it.

File Type

File Extension

Platform

Description

Mandatory File?

Investigation Description Format (IDF)

.idf

mage-tab

Provides general information about the investigation including its name, a brief description, the investigator‘s contact details, bibliographic references, and free text descriptions of the protocols used in the investigation.

yes

Sample and Data Relationship Format (SDRF)

.sdrf

mage-tab

Describes the relationships between samples, sequencing platforms, data, and other objects used or produced in the investigation. In TCGA SDRF files, a row represents an analyzed element (often an aliquot) in its most basic electronic form (i.e. raw data file) and the production of higher-level data files (Level 2 and 3) as protocols (e.g. variant calling) are applied to the file and its derivatives. These protocols correspond to those listed in the IDF.

yes

The following figure depicts the association among different files in a MAGE-TAB archive.

gsc_idf_sdrf

MAGE-TAB for sequencing assays

The purpose of this proposal is to use MAGE-TAB for capturing details for experimental and data transformation steps involved in a typical sequencing assay. Clearly formatted information about the experiment would enable end users to replicate the experimental results and interpret the data in light of the experiment design. The following figure illustrates the experimental/bioinformatics steps and associated attributes that can be captured using MAGE-TAB to describe the entire experiment from material extraction through the various sequencing steps such as library preparation and sequencing to mapping, variant calling and final validation.  Each column of the SDRF can provide experimental details or pointers to additional information that can aid users in determining how an experiment was done and what files contain the experimental results.

seqWorkflow

BAM metadata at CGHub

CGHub hosts metadata associated with BAM files. Centers deposit XML files containing metadata as part of BAM submission. In order to avoid redundant representation of information across multiple results files,the CGHub metadata reference will be provided in the SDRF so that the user can access the most current metadata from a single authoritative source. The CGHub analysis ID indicated in Comment [TCGA CGHub ID] in the SDRF can be used to view the metadata XML. Comment [TCGA CGHub metadata URL] provides a link to the latest XML file. The following attributes can be directly accessed from this XML:

  • Library attributes (strategy, source, selection, layout)
  • Sequencing attributes (instrument, algorithm)
  • Mapping attributes (algorithm, BAM file name, QC method)

Processes downstream of variant calling will be described in the SDRF since CGHub metadata may not extend to variant calling.

New GSC MAGE-Tab fields

Discussions with the GSCs have indicated the need for several fields that are not in the existing GCC MAGE-TAB specification. In keeping with the MAGE-TAB specification, these additional fields will be listed in the IDF file as Protocol Parameters as a semicolon delimited list and the actual values for those fields will be in the SDRF file using the Parameter Value[<parameter field>] column header.

The fields that will be added are:

FieldDescription
Parameter Value [Vendor]The name of the vendor supplying the protocol
Parameter Value [Catalog Name]The catalog name identifying the protocol
Parameter Value [Catalog Number]The catalog number that identifies the protocol and kit used
Parameter Value [Product URL]A vendor specific URL that points to the product
Parameter Value [Annotation URL]A URL pointing to the annotation used to develop the protocol
Parameter Value [Target Reference Accession]Accession number of the reference genome
Parameter Value [Target File URL]A URL pointing to a file containing a list of targets
Parameter Value [Target File Format]The format of the target file referenced in Target File URL
Parameter Value [Target File Format Version]The format version of the format referenced in Target File Format
Parameter Value [Probe File URL]A URL pointing to a file containing a list of the probes that should capture the targets
Probe File FormatThe format of the file indicated by the Probe File URL
Probe File Format VersionThe version of the format indicated in Probe File Format
Protocol Min Base QualityThe minimum quality value (Phred score) for a base to be counted in the coverage statistics
Protocol Min Map QualityThe minimum mapping quality (Phred score) for a read to be used in mapping
Protocol Min Tumor CoverageThe minimum coverage for coverage determination and variant calling in tumor samples
Protocol Min Normal CoverageThe minimum coverage for coverage determination and variant calling in normal samples

Additional fields for discussion

  • Comment[TCGA Sequence Target File] - Will not be used at the request of the GSC working group
  • Annotation REF -  The DCC sees a need for this field as we are under direction to promote a standard set of annotation.
  • Several of the fields are suggested to be required for capture protocol. Is there a defined set of protocols that will be submitted?

SDRF specification

The following table provides a description of column headers that should be included in a next generation sequencing SDRF.

Where data is not available, please use "->" as a placeholder to indicate that the field is empty.  Added by TDP 11/7/2013

SDRF fields

SDRF validation

Category

Column name

Description 
Sample value

Required 
Allowed values  (if applicable)

Material

Extract Name

TCGA aliquot UUID 
151626d0-a650-447a-a2b0-1df53c31c682

yes

 

Comment [TCGA Barcode]

TCGA aliquot barcode 
TCGA-A1-A0SD-01A-11R-A114-13

yes 
Aliquot barcode

 

Comment [is tumor]

Whether aliquot is from tumor sample or not 
yes

yes 
yes, no

 

Material Type

Controlled terms for state of the material 
DNA, Total RNA

yes

 

Annotation REF

URL pointing to an annotation file (e.g., GAF) 
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF3.0beta/TCGA.GRCh37-lite.June2012.3.0.gaf.gz

no 
URL

 

Comment [TCGA Genome Reference]

Identifier for reference genome used 
GRCh37-lite

yes

hg18, hg19, GRCh37, GRCh37-lite, 36, 36.1, 37

Library preparation

Protocol REF

Token referring to protocol for library construction in IDF (must be formatted as domain:protocol:platform:version) 
hgsc.bcm.edu:library_preparation:IlluminaGA:01

yes

must follow the format:

<domain>:library_preparation:<platform>:<version>

The DCC Validator will check for the GSC specific columns if the protocol = library_preparation.  This follows the GCC standard.

 Parameter Value [Vendor]

Vendor of the reagents used
Agilent  

yes

Conditionally Required (Required for commercial capture reagents, Optional for WGS and custom capture.)

 Parameter Value [Catalog Name]

Catalog name for the reagent used

Nimblegen SeqCap EZ Exome Library v3.0

yes

 

Conditionally Required (Required for commercial capture reagents, Optional for WGS and custom capture.)

 Parameter Value [Catalog Number]

Capture reagent catalog number

06465692001

yes

 

Conditionally Required (Required for commercial capture reagents, Optional for WGS, custom capture, and data predating)

 Parameter Value [Annotation URL]

URL to a source of the annotation

http://www.nimblegen.com/downloads/annotation/ez_exome_v3/SeqCapEZ_Exome_v3.0_Design_Annotation_files.zip

no

URL

 Parameter Value [Product URL]

URL to the capture reagent used

http://www.nimblegen.com/products/seqcap/ez/v3/index.html

no

URL required for capture protocol, "NA" otherwise

 Parameter Value [Target File URL]

URL to the capture reagent target file

http://www.nimblegen.com/downloads/annotation/ez_baylor_vcrome/baylor_vcrome_design_110323.zip

no

URL or "Proprietary" for capture protocol, "NA" otherwise

 Parameter Value [Target File Format]

Format of the target file

BED

no

Format name for capture protocol, "NA" otherwise

 Parameter Value [Target File Format Version]

Version of the target file

http://genome.ucsc.edu/FAQ/FAQformat.html#format1

no

Version for capture protocol, "NA" otherwise

 Parameter Value [Probe File URL]

URL pointing to the probe file

http://www.nimblegen.com/downloads/annotation/ez_baylor_vcrome/baylor_vcrome_design_110323.zip

no

URL or "Proprietary" for capture protocol, "NA" otherwise

 Parameter Value [Probe File Format]

Format of the Probe file

FASTA

no

Format name for capture protocol, "NA" otherwise

 Parameter Value [Probe File Format Version]

Version of the probe file

http://fasta_somewhere.org

no

URL for capture protocol, "NA" otherwise

 Parameter Value [Target Reference Accession]

Accession of the reference genome in Comment [TCGA GenomeReference]

GCF_000001405.12

no

Sequencing

Protocol REF

Protocol for nucleic acid sequencing. Token referring to IDF description 
hgsc.bcm.edu:DNA_sequencing:IlluminaGA:01

no

Mapping

Protocol REF

Protocol for sequence alignment 
hgsc.bcm.edu:alignment:IlluminaGA:01

yes

 

Comment [Derived Data File REF]

Aggregated BAM file; not in the data archive 
TCGA-A1-A0SD-01A-11R-A114-13_Illumina.bam

yes

 

Comment [TCGA CGHub ID]

CGHub ID if BAM file has been deposited, -> otherwise 
251827d0-a650-447a-a2b0-1df73c31c682

yes

 Comment [TCGA CGHub metadata URL]

URL pointing to BAM metadata XML deposited at CGHub 
https://cghub.ucsc.edu/cghub/metadata/analysisAttributes/251827d0-a650-447a-a2b0-1df73c31c682

no

 

Comment [TCGA Include for Analysis]

Flag to indicate if BAM file passed QC and is included for analysis 
yes

yes 
yes, no

 

Derived Data File

File containing coverage data (e.g., wig) 
TCGA-A1-A0SD.wig

no

 

Comment [TCGA Include for Analysis]

yes

yes (only if preceding file is defined)

 

Comment [TCGA Data Type]

Quantitative-Coverage

yes (only if preceding file is defined)

 

Comment [TCGA Data Level]

Level 2

yes (only if preceding file is defined)

 

Comment [TCGA Archive Name]

hgsc.bcm.edu_COAD.IlluminaGA_DNASeq.Level_2.1.0.0

yes (only if preceding file is defined)

 Parameter Value [Protocol Min Base Quality]  20

 yes (only if preceding file is defined)

Must be an integer value

 Parameter Value [Protocol Min Map Quality] 30

yes (only if preceding file is defined) 

Must be an integer value

  Parameter Value [Protocol Min Tumor Coverage] 30

 yes (only if preceding file is defined)

Must be an integer value

  Parameter Value [Protocol Min Normal Coverage] 30

yes (only if preceding file is defined)

Must be an integer value

Variant calling

Protocol REF

Protocol for variant calling 
hgsc.bcm.edu:variant_calling::01

yes

 

Derived Data File

File containing mutation calls (e.g., VCF or MAF) 
TCGA-A1-A0SD.vcf

yes

must allow multiple files (use comma if a separator is needed)

 

Comment [TCGA Spec Version]

TCGA specification version the file complies with (if applicable) 
 

must allow multiple files (use comma if a separator is needed)

 

Comment [TCGA Include for Analysis]

yes

yes 
yes, no

 

Comment [TCGA Data Type]

Mutations

yes

 

Comment [TCGA Data Level]

Level 2

yes 
Level 2

 

Comment [TCGA Archive Name]

DCC archive name (without .tar.gz) 
hgsc.bcm.edu_COAD.IlluminaGA_DNASeq.Level_2.1.0.0

yes

MAF generation

Protocol REF

Protocol for mutation filtering and annotation 
hgsc.bcm.edu:vcf2maf::01

yes

 

Derived Data File

File with filtered and annotated mutation calls (usually MAF) 
hgsc.bcm.edu_COAD.IlluminaGA_DNASeq.1.maf

no

 

Comment [TCGA Spec Version]

2.3

yes (only if preceding file is defined)

 

Comment [TCGA Include for Analysis]

yes

yes (only if preceding file is defined)

 

Comment [TCGA Data Type]

Mutations

yes (only if preceding file is defined)

 

Comment [TCGA Data Level]

Level 2

yes (only if preceding file is defined) 
Level 2

 

Comment [TCGA Archive Name]

hgsc.bcm.edu_COAD.IlluminaGA_DNASeq.Level_2.1.0.0

yes (only if preceding file is defined)

Mutation validation

Protocol REF

Protocol for mutation validation 
hgsc.bcm.edu:validation::01

no

 

Derived Data File

File containing experimentally validated mutation calls 
hgsc.bcm.edu_COAD.IlluminaGA_DNASeq.1.somatic.maf

no

 

Comment [TCGA Spec Version]

2.3

no

 

Comment [TCGA Include for Analysis]

yes

yes (only if preceding file is defined)

 

Comment [TCGA Data Type]

Mutations-Somatic

yes (only if preceding file is defined)

 

Comment [TCGA Data Level]

Level 2

yes (only if preceding file is defined) 
Level 2

 

Comment [TCGA Archive Name]

hgsc.bcm.edu_COAD.SOLiD_DNASeq.Level_2.1.0.0

yes (only if preceding file is defined)

General validation rules

  1. If any column headers are not in the 'allowed headers' list for the corresponding experiment type, then FAIL.
  2. If any row of the file has fewer tab-delimited values than the header row, then FAIL.
  3. If any required column headers are missing, then FAIL.
  4. If any columns with headers ending with 'File' (henceforth known as 'File columns') are missing any required Comment columns (see below), then FAIL.
  5. If any row contains non-blank values after a 'No' for 'Comment [TCGA Include for Analysis]' then FAIL.
  6. If value for 'Comment [TCGA Data Level]' column is not in format 'Level N' where N is a valid number, then FAIL.
  7. If the value for 'Comment [TCGA Include for Analysis]' column is not 'yes' or 'no' (case ignored) then FAIL.
  8. If the value for 'Comment [TCGA Archive Name]' column is not a valid archive name, then FAIL.
  9. If the level of an archive (as parsed from TCGA Archive Name column) does not match the value of the Data Level column (replacing '_' with ' '), then FAIL.
  10. If any value for 'Term Source REF' is not represented in the IDF file's Term Source Name row, then FAIL.
  11. If the Extract Name value is not a valid aliquot UUID then FAIL.
  12. If the Extract Name value biospecimen does not belong to the disease set for the archive's disease type, then FAIL.
  13. Protocol REF should be in the format domain:protocol:platform:version (e.g., hgsc.bcm.edu:library_creation:IlluminaGA:01).
  14. The ID used for Protocol REF must be defined in the IDF as a Protocol Name.

GSC-specific validation rules

  1. Annotation REF should be a valid URL pointing to the annotation file (GAF) being used.  
  2. Comment [TCGA Genome Reference] should be a valid reference genome identifier.
  3. Comment [TCGA CGHub ID] should be a valid CGHub analysis ID if BAM file has been deposited at CGHub and should be -> otherwise.

Additional validation rules to be added here based on discussion with GSCs.

IDF specification

The following table provides a description of fields that should be included in sequencing IDF.

IDF fields

Field

Description
Sample Value

Investigation Title

Title of the investigation
DNA sequencing analysis of TCGA OV Samples using Illumina HiSeq

Experimental Design

Experimental design types applicable to the study. Typically the terms should be obtained from an ontology.
individual_genetic_characteristics_design

Experimental Design Term Source REF

The source of the Experimental Design terms; this must reference one of the Term Source Names defined elsewhere in the IDF file
MGED Ontology

Experimental Factor Name

A user-defined name for each experimental factor studied by the experiment
genotype

Experimental Factor Type

A term describing the type of each experimental factor. These terms will usually come from an ontology.
disease_state

Person Last Name

The last name of each person associated with the experiment

Person First Name

The first name of each person associated with the experiment

Person Middle Initials

The middle initials of each person associated with the experiment

Person Email

The email address of each person associated with the experiment

Person Address

The street address of each person associated with the experiment

Person Affiliation

The organization affiliation for each person associated with the experiment

Person Roles

The role(s) performed by each person. Typically these terms should come from the MGED Ontology.
investigator;submitter

PubMed ID

The PubMed IDs of the publication(s) associated with this investigation (where available)
123456

Publication Author List

The list of authors associated with each publication
Doe, J., Shakespeare, W.

Publication Title

The title of each publication

Publication Status

A term describing the status of each publication (e.g. "submitted", "in preparation", "published")

Experiment Description

A short paragraph describing the experiment as free-text. This tag can only have one value.

Protocol Name

The names of the protocols used within the MAGE-TAB document. These will be referenced in the SDRF in the "Protocol REF" columns. Must be in domain:protocol:platform:version format.
hgsc.bcm.edu:DNA_extraction:IlluminaGA_DNASeq:01

Protocol Type

The type of the protocol, taken from a controlled vocabulary
library preparation

Protocol Description

A free-text description of the protocol
PCR with sequencing primers, size fractionation

Protocol Parameters

A semicolon-delimited list of parameter names; these names are used in the SDRF file (as "Parameter[<parameter name>]" headings) to list the values used for each protocol parameter.
Protocol Min Base Quality;Protocol Min Map Quality

Protocol Term Source REF

The source of the Protocol Type terms; this must reference one of the Term Source Names defined elsewhere in the IDF file.
Sequence Ontology

SDRF File

The name(s) of the SDRF file(s) accompanying this IDF file
tcga_coad_sdrf.txt

Term Source Name

The names of the Term Sources (ontologies or databases) used within the MAGE-TAB document
Sequence Ontology

Term Source File

A filename or valid URI at which the Term Source may be accessed
http://song.cvs.sourceforge.net/song/ontology/

Term Source Version

The version of the Term Source used throughout the MAGE-TAB document
1.328

IDF validation

General validation rules

  1. If the IDF is blank then FAIL.
  2. If any row headers are not in the Allowed Headers list then FAIL.
  3. If the 'Protocol Name' header is missing then FAIL.
  4. If 'Protocol Name' value does not have the valid format <domain>:<protocolType>:<platform>:<version> then FAIL.
  5. If the 'Protocol Description' header is missing then FAIL.
  6. If the first value for any header is "->" (indicating a blank) then FAIL.
  7. If the 'Term Source Name' header is present, then the number of values for 'Term Source Name', 'Term Source File', and 'Term Source Version' must be the same, otherwise FAIL.
  8. If any SDRF column header contains a 'Term Source REF' value that is not represented under the IDF "Term Source Name" header then FAIL.

GSC-specific validation rules

  1. IDs used for Protocol REF in SDRF must be defined in IDF under Protocol Name.
  2. Parameter Values associated with a specific Protocol REF in the SDRF must be defined in the IDF as comma-separated list under Protocol Parameters field for that protocol.

Experimental protocols defined in IDF and referenced in SDRF

  • Please note that each Protocol Type should be included in your IDF. However, Protocol Name and Protocol Parameters should be stated as they apply to your specific protocol and should be referenced in the SDRF.
  • It is important that IDF/SDRF are updated and resubmitted whenever a protocol is revised, added, or removed. This would ensure that users have accurate experimental details available.
  • Protocol Description should be modified to depict your experiment. If a URL is available for protocol description (e.g., official webpage for a base-calling algorithm), assign it to the Protocol Description field so that users can directly refer to the page.
  • A semicolon-delimited list of parameter names is defined in Protocol Parameters. These are used in the SDRF file (as "Parameter Value [<parameter name>]" headings) to list the values used for each protocol parameter. If more than one parameter was used for a given protocol, they should be separated with semicolons (";").

TCGA MAGE-TAB Archive Requirements

Archive Names

TCGA MAGE-TAB archive names follow a common naming convention:

<domain> _ <disease study> . <platform> . <archive type> . <serial index> . <revision> . <series>

Label

Description

domain

The domain for a TCGA center is the Internet domain name associated with the submitting center's institution. Even if there is involvement from other centers, the domain reflects only the submitting center.

For example, broad.mit.edu is the domain for the Broad Institute at MIT, and mskcc.org is the domain for Memorial Sloan-Kettering Cancer Center.

disease study

disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study. Within the project, a disease is referred to by its abbreviation. For example, Glioblastoma multiforme is represented by the abbreviation GBM.

A complete list of disease studies and their abbreviations is found in the Code Tables Report.

platform

platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or GCC. This is represented by a platform code

For a complete list of platform codes, see the column "Platform Alias" in the platforms code report.

archive type

The archive type is the classification of a TCGA archive. For a MAGE-TAB Archive, this value is 'mage-tab'.

serial index

Archives corresponding to the same <domain>_<disease study>.<platform> will have one and only one corresponding mage-tab archive. Conventionally, the serial index of the mage-tab archive is 1, however, the serial index is chosen by the submitting center.

For other archive types, the serial index is a number that uniquely identifies an independent data set from a particular experiment. There is no overlap of data files between archives of differing serial numbers. A numbering is entirely up to the data submission center. In general, BCRs use a serial index equivalent to a batch number while other center types start serial index series from 1.

revision number

A revision number can indicate the number of times an archive has been revised (starting from 0) and submitted to the DCC. However, the only requirement for revision numbers is that the revision number of the new archive is to be higher than that of the archive being replaced. Files that have been changed or added are captured in the changes and additions files, respectively.

series number

This feature is currently disabled, the series number should always be 0.

Archive Contents

MAGE-TAB archives must contain:

  • And IDF file (ending in .idf.txt)
  • An SDRF file (ending in .sdrf.txt)
  • A MANIFEST.txt file containing MD5 values for each file.

 

(Source)

 

  • No labels