Skip Navigation
National Cancer Institute U.S. National Institutes of Health www.cancer.gov
NCI Wiki New Account Help Tips
Skip to end of metadata
Go to start of metadata

The Sample and Data Relationship Format (SDRF) is a tab-delimited file that encapsulates a succession of processes applied to samples and the multiple states it takes on as a result of the processes. With TCGA data, SDRFs allow data users to follow data processing (e.g. normalization) of data files, starting from its rawest form (Level 1) and seeing the result of the process in the form of a data file at a higher Data Level.

SDRFs are part of the MAGE-TAB standard.

Contents

SDRF File Content

A Sample and Data Relationship (SDRF) file is a tab-delimited file that describes the relationships between samples, array, data, and other objects used or produced in the experiment. An SDRF contains one or more column headers for the following main types of metadata:

Column header type

Description

Examples

Name

Name of the sources and/or samples used in the array. There can be multiple columns of names

  • Scan Name
  • Hybridization Name

Protocol REF

Provides ID(s) for one or more protocols used in the array and referenced in a corresponding IDF file or MAGE document files

---

File

One or more columns that list files produced in the investigation

  • Array Data File
  • Derived Array Data File

Attribute

Values, comments, or characteristics relating to and modifying one of the above kinds of columns

  • Date
  • Provider
  • Performer
  • Label
  • Factor Values

Comment

A MAGE-TAB compliant attribute used by the DCC to include TCGA archive and file information related to samples directly in the SDRF

---

The Extract Name column of an SDRF almost always contains the BCR-issued aliquot barcode for TCGA samples; it can also contain other types of IDs for non-TCGA material, such as control samples. These barcodes map assayed samples to platforms, experiments, and results files (e.g. probe signal, gene expression files or copy number files).

The Hybridization Name column of an SDRF contains the ID that is referenced in the column header for Level 2 and 3 data matrices.

Column headers can be used as many times as necessary to adequately describe the use and interaction of materials in the experiment. For more information, see pages 34 and 35 of the MAGE-TAB specification document Exit Disclaimer logo .

Visualizing SDRF as a DAG

An SDRF is a text-based representation of a directed acyclic graph (DAG). A DAG illustrates what protocols (listed in Protocol REF columns) were used to process a set of samples that produced the resulting experimental results. It shows the relationships between samples, arrays, data, and other objects used or produced in the investigation, and provides all Minimal Information About a Microarray Experiment (MIAME) that is not provided elsewhere. The SDRF file is often the most important part of the experiment description, since it provides a machine readable way to record and recover the complex relationships which are possible between samples and their respective hybridizations. Construction of simple experiment designs are straightforward, but even complex experimental designs can be expressed in an SDRF.

The SDRF file describes an investigation design graph (IDG), which is a DAG representing how the experiment was carried out. The investigation design graph is a general concept that can be used to represent the workflow of any investigation, and is not restricted to microarray investigations. The SDRF relates the aliquot ID (Extract Name) ultimately to the result files that are produced. The SDRF file consists of a table where each row corresponds to a path in the graph from one of the source nodes to one of the 'sink' nodes and columns represent the steps of the experiment and the data files that result from applying different protocols. The ordering of these columns is important, and should read left-to-right in chronological order. More information on creating SDRF documents is available in SDRF Help notes in the Mage Tabulator web page Exit Disclaimer logo .

Required Columns

There are several TCGA-specific SDRF columns per Data File column that are required to guarantee that archives are correctly processed:

Column name

Description

Example

Comment [TCGA Archive Name]

Name of the archive a file is contained in

broad.mit.edu_GBM.Genome_Wide_SNP_6.Level_1.1.0.0

Comment [TCGA Data Type]

Data type of the file

Expression-Gene

Comment [TCGA Data Level]

Data Level of the file

Level 2

Comment [TCGA Include for Analysis]

A binary value ('yes' or 'no') marking if the file quality is fit for analysis

yes

File columns

"*File" refers to any column header that ends with the word "File". Values under this column refer to an existing data file. Raw and processed data files can be ASCII or binary files, typically in their native formats. Alternatively, data may also be provided in the specially-defined tab-delimited format MAGE-TAB Data Matrix. MAGE-TAB data matrices (i.e. an Array Data Matrix File or a Derived Array Data Matrix File) have specified formats described in the MAGE-TAB specification. Array Data Files and Derived Array Data Files denote both raw and processed files for a particular sample.

The Array Design File column is used as pointer to an Array Design Format (ADF) file included in a transferred archive. Related, the Array Design REF column is used to reference an Array Design File available via the Web. For example:

Affymetrix.com:PhysicalArrayDesign:HG-U133_Plus_2

See the page on ADFs for more information.

Array Design REF and Array Design File

Array Design REF: The "Array Design REF" in the SDRF file is used as an ID for an Array Design deposited in online databases. All TCGA Array Designs will eventually be available in caArray. Those array designs will be referenced using the "Array Design REF" provided in the SDRF. Although "Array Design REF" names are self-assigned, you should follow the prescribed scheme below and those names should persist unless the Array Design changes. The format of that name should be:VendorDomain:PhysicalArrayDesign:PlatformCode

Example:
Affymetrix.com:PhysicalArrayDesign:HG-U133_Plus_2
or
Agilent.com:PhysicalArrayDesign:HG-CGH-244K_A

where:

  • VendorDomain should match your vendor's Internet domain name (e.g. Agilent.com).
  • PhysicalArrayDesign is an MGED Ontology term and should be entered verbatim.
  • PlatformCode, when possible, should match your vendor's code (e.g. HG-U133_Plus_2) for the platform you are using. In instances where your vendor does not provide that type of code, terse yet descriptive abbreviated words can be used. Chip sets that are different in design (e.g. Affymetrix 500K Set) should be named differently. In cases where the code is too generic (e.g. 550K), the code should be concatenated with the vendor's abbreviated name (e.g. Illum550K). More information on different platforms and associated codes can be found in the Platform topic.

Array Design File: The "Array Design File" is used as pointer to an Array Design Format (ADF) file included in a transferred archive. ADF files should be provided for non-standard array designs (e.g. Illumina) or complex designs where additional metadata can be provided (e.g. Affymetrix Exon 1ST). In those cases, please transfer a standardized ADF file in your first archive only. In all subsequent archives, place a blank dummy file with the same ADF file name in your archive as a placeholder. That will ensure that your archive is valid. The DCC will ensure that your archives are linked to the original ADF file. If you do include an ADF file, you should list the ADF file name in the "Array Design File" column and list your "Array Design REF" name.

Column Relationships through SDRF Example

Following is a segment of a TCGA SDRF file and the corresponding investigation design graph it represents. The graph shows sample values (enclosed within <>) from the first row of the SDRF. The SDRF file refers to terms defined in the attached IDF file. A few points to note are:

  1. Named columns (e.g., Labeled Extract Name) are nodes in the DAG, and Protocol REFs are edges.
  2. Nodes could correspond to biomaterials or data objects and have metadata associated with them. Examples of such metadata are data files (e.g., Array Data File) and annotations (e.g., Label, Array Design REF)
  3. Files can have metadata too (e.g., Comment[TCGA Archive Name])
  4. Each node and edge column may be associated with one or more attribute columns containing annotation. In each case the attribute column follows immediately after the respective node or edge column. For example,
  5. Where ontology terms are used, a "Term Source REF" field should follow immediately to the right of the column containing the actual ontology terms. In the example shown below, <Affymetrix.com:PhysicalArrayDesign:Genome_Wide_SNP_6> (Array Design REF) is associated with <caArray> (Term Source REF), which in turn is defined in the IDF.
Extract Name	                Protocol REF	                                Labeled Extract Name	        Label	Term Source REF	Protocol REF	                                        Hybridization Name	                                        Array Design REF	                                Term Source REF	Protocol REF	                                        Scan Name	                Array Data File	                                                Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                                Normalization Name	        Derived Array Data Matrix File	                                                Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                        Normalization Name	        Derived Array Data Matrix File	                                                Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                Normalization Name	        Derived Array Data Matrix File	                                                        Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                        Normalization Name	        Derived Array Data Matrix File	                                                        Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                                Normalization Name	        Derived Array Data Matrix File	                                                                Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                                Normalization Name	        Derived Array Data Matrix File	                                                        Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]	                        Protocol REF	                                        Normalization Name	        Derived Array Data File	                                                Comment [TCGA Archive Name]	                        Comment [TCGA Data Type]	Comment [TCGA Data Level]	Comment [TCGA Include for Analysis]	Comment [md5]
TCGA-07-0227-20A-01D-1427-01	broad.mit.edu:labeling:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	biotin	MGED Ontology	broad.mit.edu:hybridization:Genome_Wide_SNP_6:01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450	Affymetrix.com:PhysicalArrayDesign:Genome_Wide_SNP_6	caArray	        broad.mit.edu:image_acquisition:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.CEL	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_1.45.1005.0	Copy Number Results-SNP	        Level 1	                        yes	                                212d2797383a4b644913aa4023e468cc	broad.mit.edu:invariantset_medianpolish:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.ismpolish.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.45.1005.0	Copy Number Results-SNP	        Level 2	                        yes	                                e7bae3547949605e463e7c86f47a44ab	broad.mit.edu:birdseed_genotype:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.birdseed.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.45.1005.0	Copy Number Results-SNP	        Level 2	                        yes	                                dbcb4dffab2d0783bca5d4aa019c6436	broad.mit.edu:copy_number:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.raw.copynumber.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.45.1005.0	Copy Number Results-SNP	        Level 2	                        yes	                                a3aff7a9863a1ae26c601de3ac162752	broad.mit.edu:copynumber_byallele:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.byallele.copynumber.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.45.1005.0	Copy Number Results-SNP	        Level 2	                        yes	                                771e188a21b2c04b373a03aa68fd34b3	broad.mit.edu:no_outlier_copy_number:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.no_outlier.copynumber.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.45.1005.0	Copy Number Results-SNP	        Level 2	                        yes	                                7d5c485109c076c45df83644db7af0bd	broad.mit.edu:after_5NN_copy_number:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.after_5NN.copynumber.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.45.1005.0	Copy Number Results-SNP	        Level 2	                        yes	                                0b78b4a9ab5cdc50278cfa13b8db2b25	broad.mit.edu:segmented_cna:Genome_Wide_SNP_6:01	TCGA-07-0227-20A-01D-1427-01	DEBUT_p_TCGAb45_81_wRedosSNP_N_GenomeWideSNP_6_D04_729450.seg.data.txt	broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_3.45.1005.0	Copy Number Results-SNP	        Level 3	                        yes	                                3246596ffa99597e69aa28f983e4d185
TCGA-A6-2670-01A-02D-0819-01	broad.mit.edu:labeling:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	biotin	MGED Ontology	broad.mit.edu:hybridization:Genome_Wide_SNP_6:01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014	        Affymetrix.com:PhysicalArrayDesign:Genome_Wide_SNP_6	caArray	        broad.mit.edu:image_acquisition:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.CEL	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_1.28.1006.0	Copy Number Results-SNP	        Level 1	                        yes	                                dce608413f34c7d44922c539c430472a	broad.mit.edu:invariantset_medianpolish:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.ismpolish.data.txt	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.28.1006.0	Copy Number Results-SNP	        Level 2	                        yes	                                7e1c1ab2e41ccc8847f2822cf874f0a7	broad.mit.edu:birdseed_genotype:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.birdseed.data.txt	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.28.1006.0	Copy Number Results-SNP	        Level 2	                        yes	                                967216c8b3fae5f0f85a51ff75306c05	broad.mit.edu:copy_number:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.raw.copynumber.data.txt	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.28.1006.0	Copy Number Results-SNP	        Level 2	                        yes	                                ca06d651a50d1f3c25695de6715d9084	broad.mit.edu:copynumber_byallele:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.byallele.copynumber.data.txt	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.28.1006.0	Copy Number Results-SNP	        Level 2	                        yes	                                0035fbcc0dd667d9626e05ef23eaeea4	broad.mit.edu:no_outlier_copy_number:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.no_outlier.copynumber.data.txt	                broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.28.1006.0	Copy Number Results-SNP	        Level 2	                        yes	                                8d676bd7de8f6122c743ae83c54c9cbf	broad.mit.edu:after_5NN_copy_number:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.after_5NN.copynumber.data.txt	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_2.28.1006.0	Copy Number Results-SNP	        Level 2	                        yes	                                130b749160a4a28103ea5ffa5caa6549	broad.mit.edu:segmented_cna:Genome_Wide_SNP_6:01	TCGA-A6-2670-01A-02D-0819-01	VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_G01_569014.seg.data.txt	        broad.mit.edu_COAD.Genome_Wide_SNP_6.Level_3.28.1006.0	Copy Number Results-SNP	        Level 3	                        yes	                                3119829eecc39f914753c036220b3233

Gliffy Zoom Zoom sampleSDRF_SNP

To see a pictorial representation your experimental design, an investigation design graph, as a result of the MAGE-TAB you create, download and install the latest version of the EBI's Tab2MAGE package of software, and then run the expt_check.pl script. Details about the Tab2MAGE package and the script are provided at tab2mage on the Mage Tabulator Home Page Exit Disclaimer logo .

Guidelines on Creating SDRF Files

  1. Where data is not available, please use "->" as a placeholder to indicate that the field is empty.
  2. The order of columns is important in SDRF files. SDRF files represent an ordered graph where the nodes are *"Name" columns (e.g. Extract Name) and edges describe the steps taken between the nodes (e.g. performing a protocol). The "->" placeholders imply direction.
  3. Unlike IDF, in the SDRF file, "Term Source REF" is not the suffix of another column header. For example, for the column header "Label" "biotin" is a controlled vocabulary term. The "Term Source REF" that directly follows "Label" (i.e. MGED Ontology) identifies the source of that term.
  4. If you are starting with the Extract you received from the BCR, you only need to enter the BCR aliquot barcode, including the BCR plate barcode, in the "Extract Name" field. All other columns before "Extract Name" (i.e. "Source Name," "Sample Name," and all source and sample characterization data from the BCR) will be merged with your SDRF data by the DCC. Internal Controls and non-BCR analytes are an exception to this.
  5. Use "Parameter Value [Amplification]" to enter whether your BCR extract was amplified. That is, if the row for an aliquot barcode (in Extract Name) contains a W or G as part of the analyte code, then enter "yes" as the value in the SDRF "Parameter Value [Amplification]" column for that aliquot's row.
  6. There should usually be a one-to-one relationship between the "Scan Name" and the "Hybridization Name" depending on the experimental design. The "Hybridization Name" should match the "Scan Name" because of the way the MAGE-TAB software works, and since in the majority of experimental designs a single scan is conducted on a hybridization. However, that is a general rule; you should use hybridization and scan names that reflect your experimental design.
  7. Depending on the experimental/investigation design, "Normalization Name" should usually be the same as "Hybridization Name" and "Scan Name." Also the last column, "Derived Array Data Matrix File," and its variations are usually directly preceded by "Normalization Name", the file being the result of a normalization procedure. However, this is only a general rule; the names you use should reflect your experimental design.
  8. Data File columns list files with specific formats. Array Data Files and Derived Array Data Files are in native format, while Array Data Matrix Files and Derived Array Data Matrix Files are in MAGE-TAB Data Matrix format.
  9. If a file is not in MAGE-TAB Data Matrix format, then it should not be listed under a Data Matrix column. Doing so will cause validation to fail.
  10. The files listed in Data File columns are associated with a particular data type and data level. In general, Array Data Files and Array Data Matrix Files are considered Level 1 data. All other level files should be listed under Derived Array Data File or Derived Array Data Matrix File depending on the format of the file.
  11. Three TCGA Comment columns must follow each File column: "Comment [TCGA Data Type]", "Comment [TCGA Data Level]", and "Comment [TCGA Include for Analysis]".
  12. Comment[Include for Analysis] is binary (yes or no) and provides a method to indicate that a particular result is not adequate for analysis.

Internal Controls and Non-BCR Analytes

The DCC merges BCR biospecimen characterization data with center experimental assay results before the results are published. However, the DCC does not have the characterization data for internal controls or non-BCR analytes that centers may be using for quality control. Therefore, centers must provide that data using the MAGE-TAB specification.

For non-BCR analytes, data must be provided for all columns preceding and including "Extract Name." Those columns provide a minimum amount of characterization. If other characterization can be provided, refer to the MAGE-TAB specification for which columns to add. A MAGE-TAB placeholder (->) should be entered for all BCR analyte columns preceding "Extract Name."

File Name Format

SDRF file names are derived from the name of the archive where they are housed, as

<Domain>_<Disease Study>.<Platform>.<Archive Serial Index>.sdrf.txt

Note that the archive revision and series numbers are not included. See Archive Naming Conventions.

The following is an example filename for an SDRF:

broad.mit.edu_GBM.HT_HG-U133A.1.sdrf.txt

SDRF File Validation

Purpose

The validator checks SDRF file in the MAGE-TAB archive to ensure required elements and values are present, and to check certain platform or data type-specific rules.

Runs On

MAGE-TAB archive SDRF files with the extension '.sdrf.txt'.

Actions

For GSC MAGE-TAB archives, see the GSC MAGE-TAB specification.

For GCC MAGE-TAB archives:

  • If any column headers are not in the 'allowed headers' list for the corresponding experiment type, then FAIL
  • If any row of the file has fewer tab-delimited values than the header row, then FAIL
  • If any required column headers are missing, then FAIL
  • If any columns with headers ending with 'File' (henceforth known as 'File columns') are missing any required Comment columns (see below), then FAIL
  • If any row contains non-blank values after a 'No' for 'Comment [TCGA Include for Analysis]' then FAIL
  • If value for 'Comment [TCGA Data Level]' column is not in format 'Level N' where N is a valid number, then FAIL
  • For protein array experiments:
    • If the File column's 'Comment [TCGA Data Level]' value is 'Level 1' and the File column header is not one of 'Image File', 'Array Data File', or 'Derived Array Data Matrix File' then FAIL
    • If the File column's 'Comment [TCGA Data Level]' value is 'Level 2' and the File column header is not one of 'Derived Array Data File', 'Derived Array Data Matrix File' then FAIL
    • If the File column's 'Comment [TCGA Data Level]' value is 'Level 3' and the File column header is not 'Derived Array Data Matrix File' then FAIL
  • If the value for 'Comment [TCGA Include for Analysis]' column is not 'yes' or 'no' (case ignored) then FAIL
  • If the value for 'Comment [TCGA Archive Name]' column is not a valid archive name, then FAIL
  • If the level of an archive (as parsed from TCGA Archive Name column) does not match the value of the Data Level column (replacing '_' with ' '), then FAIL
  • For DNA Array experiments:
    • If the values for 'Array Design REF' columns don't match pattern
      ([a-zA-Z0-9\-_.]+)[:]+([a-zA-Z0-9\-_]+)[:]+([a-zA-Z0-9-_]+)
      then FAIL
  • For protein array experiments:
    • If the 'Sample Name' column is missing or repeats more than once, then FAIL
    • If the 'Comment [TCGA Biospecimen Type]' is 'Shipped Portion' then the value of 'Sample Name' for that row must be a valid shipped portion UUID, otherwise FAIL
  • For RNASeq experiments:
    • Each unique extract name value must be linked to a gene file, an exon file, and a splice junction file, otherwise FAIL
    • Each unique extract name value may be linked to a coverage (wig) file, otherwise warn
    • If the value for 'Extract Name' is not used as a substring in 'Assay Name', then FAIL
  • For miRNASeq experiments:
    • Each unique extract name value must be linked to a isoform.quantification and mirna.quantification file, otherwise FAIL
    • If the value for 'Extract Name' is not used as a substring in 'Assay Name', then FAIL
  • If any value for 'Term Source REF' is not represented in the IDF file's Term Source Name row, then FAIL
  • For RNASeq, and miRNASeq experiments, if the Extract Name value is not a valid aliquot barcode or a valid UUID, then FAIL.
  • For DNA Array, if the Extract Name is not a valid barcode or UUID and the Source Name and Sample Name values are non-blank, then the row is accepted as a non-TCGA control. Otherwise, if the Extract Name value is not a valid aliquot barcode, or valid UUID, then FAIL.
  • For DNA Array, RNASeq, and miRNASeq experiments, if the Extract Name value biospecimen does not belong to the disease set for the archive's disease type, then FAIL
  • If a biospecimen is referenced as being in more than one archive of the same type but with different serial indices, then warn

Allowed Headers

Index

MAGE-TAB Node/Edge

Platform Type

Graph Type

Description

Value Type

Example(s)

Associated nodes/attributes

1

Source Name

all

node

Identifies the origin of a biospecimen or control. Since BCRs usually provide biospecimens at the sample or extract level, Source is required when there are non-TCGA provided biospecimens or controls--they need to be described since there is no record from the BCR on them.

ID

Patient ID, Biospecimen ID, non-TCGA biospecimen/control ID or name

Characteristics, Provider, Material Type, Description, Comment

2

Sample Name

all

node

Identifies derivatives of a biospecimen or control Source as the result of protocol edges between Sample Name and previous nodes. Sample can refers to a sample of tissue. Sample is required for protein arrays and when there are non-BCR provided biospecimens or controls--they need to be described since there is no record from the BCR on them. BCR-provided biospecimens are Shipped Portions in Sample Name and are the primary key WRT TCGA biospecimen IDs.

ID

Sample ID, Biospecimen ID

Characteristics, Material Type, Description, Comment

3

Extract Name

all

node

Identifies the result of deriving a molecular analyte using a protocol. e.g. DNA, RNA, Protein, etc. Many protocols may be used in a series to derive the final extract that is used for labeling, hybridization, and/or experimental assay. Extract Name is a primary key column for aliquot IDs in TCGA

ID

TCGA Barcode, UUID or some other extract, such as a reference standard used in an experiment

Characteristics, Material Type, Description, Comment, Comment [TCGA Barcode]

4

Comment [TCGA Barcode]

All but Protein Array RPPA

attribute

TCGA aliquot metadata corresponding to the aliquot UUID in Extract Name

TCGA aliquot Barcode

TCGA-BJ-A0ZB-01A-11D-A10R

5

Labeled Extract Name

hybridization-based platforms, Protein Arrays

node

The Labeled Extracts in an experiment are those materials which have been conjugated to a label of some kind, prior to hybridization on an array. Typically there is only one Labeled Extract step. This column contains user-defined names for each Labeled Extract material.

ID

ID

Characteristics, Material Type, Description, Label, Comment

6

Hybridization Name

hybridization-based platforms, Protein Arrays

node

This column contains user-defined names for each Hybridization.

ID

uA Chip serial, Hyb ID

Array Design File / REF, Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Comment

7

Scan Name

all

node

If desired, the act of scanning the hybridized array may be represented as a distinct node in the experimental graph, and encoded in the SDRF using Scan Name columns. These columns are optional, but can be useful in cases where multiple scans have been made of a single hybridized array, but where the data files do not explicitly reflect this.

ID

Usually the same as Hybridization Name

Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Comment

8

Normalization Name

all

node

Represents the act or result of normalizing your data independently from the listing of data files themselves

ID

Usually the same as Hybridization Name

Derived Array Data File, Derived Array Data Matrix File, Comment

9

Array Data File

hybridization-based platforms

node

Identifies a Level 1 (raw) data file that is the result of scanning an an array platform object. Files of this object type are in a platform specific format, sometimes binary.

CEL, GPR, Raw TXT data files

Filename for Raw Data file

Comment

10

Derived Array Data File

hybridization-based platforms, Protein Arrays

node

Identifies a data file that is derived from a lower level array-based data file. Files of this object type are in a format specific to a platform or analytical method (e.g. Circular Binary Segmentation (CBS)).

CHP

Filename for Normalized Data files

Comment

11

Array Data Matrix File

hybridization-based platforms

node

Identifies a Level 1 (raw) data file that is the result of scanning an an array platform object. Files of this object type are in MAGE-TAB data matrix format.

Raw values for all genes vs hybridizations

Filename for Data Matrix

Comment

12

Derived Array Data Matrix File

hybridization-based platforms, Protein Arrays

node

Identifies a data file that is derived from a lower level array-based data file. Files of this object type are in MAGE-TAB data matrix format or a slightly modified version of it (e.g. adding a chr coords column).

Normalized values for all genes vs hybridizations

Filename for Normalized Data Matrix

Comment

13

Image File

all

node

Identifies an image data file. Image data files maybe associated with any result object. Historically image files are associated with the result of a scan.

Filename for some sort of image file

TIFF, BMP, JPG, etc

Comment

14

Array Design File / REF

hybridization-based platforms, Protein Arrays

node

An array design describes the position of probes on an array and usually the associations between those probes and the genome or genes or other composite elements. An Array Design File should be an actual file that is included with a submission. An Array Design REF is a reference to an array design that can be either an URI/ID that indicates the database it is contained in, or a URL pointing the the array design.

Reference to a caArray Array AD ID or Filename of an ADF

Term Source REF, Comment

15

Assay Name

RNASeq, miRNASeq

node

Identifies the result of performing a molecular assay. An value in in this columns may be referred to in other MAGE-TAB documents using REF (e.g. in a Data Matrix). Used as an identifier within the MAGE-TAB document.This column contains user-defined names for each Assay. Assay Name may be used instead of Hybridization Name to identify generic biological assays, such as rtPCR. Note that this column should not be used for submission of regular microarray experiments to ArrayExpress.

ID

Technology Type, Comment[]

16

Annotation REF

RNASeq, miRNASeq

node

Refers to a file containing genomic annotations (e.g. GAF). REF indicates that the file is not in the submitted experiment archives.

URI

https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/
distro_ftpusers/anonymous/other/GAF/
GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf

Comment[]

17

Data Transformation Name

RNASeq, miRNASeq

node

Identifies the result of performing a transformation (instead of a normalization; e.g. alignment) of data in a file from a previous node. An value in in this columns may be refered to in other MAGE-TAB documents using REF (e.g. in a Data Matrix).

ID

TCGA-BH-A0W3-01A-11R-A109-07 consensus_mRNA

Comment[]

18

Derived Data File REF

RNASeq, miRNASeq

node

Refers to a file containing data derived from the protocol listed in the preceding column. REF indicates that the file is not in the submitted experiment archives.

URI or DB specific ID

BAM file name or URI or UUID

Comment[]

19

Array Name

Protein Arrays

node

Identifies the result of the protocol for constructing an array.

ID

ID/name describing the event of printing an array (e.g. AKT_pS473(V)_GBL9010352)

Array Design File, Comment[]

20

Annotations File

Protein Arrays

node

Refers to a file containing genomic annotations (e.g. GAF) or reporter annotations (e.g. antibodies). The file suffix indicates that the file is incuded in the experiment archives; annotations are always in the MAGE-TAB archive since they are common files.

name of the file that describes the annotations for antibodies (or any other kind of annotation file for other platforms)

mdanderson.org_OV.antibody_annotations.txt

Comment [TCGA Data Type], Comment [TCGA Data Level], Comment [TCGA File Type], Comment[]

21

Protocol REF

all

edge

An ID for a protocol described in an IDF file

Reference to a MAGE-TAB Protocol Name

A MAGE-TAB Protocol ID

Term Source REF, Parameter, Performer, Date, Comment

22

Characteristics [ ]

all

attribute

Describes a characteristic of a biological object (source, sample, extract). The specific characteristic is listed between the square brackets.

Ontological or Controlled Vocabulary

Unit, Term Source REF

23

Provider

all

attribute

Describes the provider of a biological object (source, sample, extract).

The provider of a Source; and address

Comment

24

Material Type

all

attribute

Controlled terms for the state of the BioMaterial. Each state (BioSource, different BioSamples, and LabeledExtract) have MaterialTypes. Examples are population of an organism, organism, organism part, cell, etc.

Ontological or Controlled Vocabulary

DNA, RNA, Tissue, Tumor, etc

Term Source REF

25

Label

all

attribute

Controlled vocabulary term. Used as an attribute column following Labeled Extract Name. The label compound which is conjugated to an Extract to create the Labeled Extract. For ArrayExpress submissions this term should be an instance of LabelCompound from the MGED Ontology. Examples: Cy3, Cy5, biotin, alexa_546. The following columns can be used to annotate Label columns:

Ontological or Controlled Vocabulary

Cy3

Term Source REF

26

Factor Value [Experimental Factor name] ( )

all

attribute

The Factor Values for an experiment are the values of the variables under investigation. For example, an experiment studying the effect of different compounds on a cell culture would have compound as an experimental variable. These variables are listed in the IDF as Experimental Factor Names with associated Types.

Ontological or Controlled Vocabulary

Factor Value [Drug]: lapatinib

Unit, Term Source REF

27

Performer

all

attribute

Used as an attribute column following Protocol REF. The name of the researcher who carried out the protocol.

Person

Comment

28

Date

all

attribute

Used as an attribute column following Protocol REF. The date (and time, where available) upon which the protocol was performed, in the following format: YYYY-MM-DD

Date

29

Parameter Value [Reference to Protocol Parameter]

all

attribute

Used as an attribute column following Protocol REF columns. This column contains values for the protocol parameters referenced in the column header.

Ontological or Controlled Vocabulary

For example, if a Protocol Name Array Hybridization is defined in the accompanying IDF, with Protocol Parameters hyb temp;hyb volume, the following would be valid: Protocol REF, Parameter Value [hyb temp], Unit [TemperatureUnit], Parameter Value [hyb volume], Unit [VolumeUnit]

Unit, Comment, Term Source REF

30

Unit [ ]

all

attribute

Controlled vocabulary term. Used as an attribute column following Characteristics[], Factor Value[] or Parameter Value[]. This column contains terms describing the unit(s) to be applied to the values in the preceding column. The type of unit is included in the column heading, e.g. Unit[TimeUnit]. These unit types should correspond to Unit subclasses from the MGED Ontology.

Ontological or Controlled Vocabulary

Term Source REF

31

Description

all

attribute

Used as an attribute column following Source Name, Sample Name, Extract Name, or Labeled Extract Name. A free-text description to be attached to the corresponding material. To be used sparingly, if at all - most annotations should be provided using controlled vocabulary terms, using Characteristics[] columns.

Text

32

Term Source REF

all

attribute

Used as an attribute column following any controlled vocabulary column (e.g., Characteristics[], or column allowing reference of external entities (e.g., Protocol REF. This column contains references to ontology or database Term Sources defined in the IDF, and from which the values in the previous column were taken.

Reference to an Ontological or Controlled Vocabulary Source

33

Comment [ ]

all

attribute

This column can be used to annotate the main graph node and edge columns listed above. It is included as an extensibility mechanism, and should not generally be used to encode meaningful biological annotation. The column heading should contain a name for the type of values included in the column.

Ontological or Controlled Vocabulary

34

Technology Type

all

attribute

Used as an attribute column following Assay Name. This column contains terms describing the type of each generic (non-hybridization) assay. The Term Source REF column in this case would point to the ontology (defined in the IDF) from which the Technology Type terms are taken.

Ontological or Controlled vocabulary term

rtPCR

Term Source REF

35

Comment [NCBI SRA Experiment Accession]

RNASeq, miRNASeq

attribute

Allows a center to associate results with a SRA Experiment

NCBI SRA Experiment Accession

SRX010729

none

36

Comment [Genome reference]

all

attribute

Allows a center to associate results with a specific genome build that was used as the basis for analysis

UCSC or NCBI Genome Build ID

HG18

none

37

Comment [NCBI dbGAP Experiment Accession]

RNASeq, miRNASeq

attribute

Allows a center to associate results with a dbGaP Experiment

NCBI dbGAP Experiment Accession

phs000178

none

38

Comment [TCGA Include for Analysis]

all

attribute

Provides a method for the center to indicate that a particular result file is not adequate for analysis

Boolean (yes/no)

Yes

none

39

Comment [TCGA Data Type]

all

attribute

Provides a method for the center to indicate the TCGA data type of a result file

TCGA Data Type

Copy Number-SNP

none

40

Comment [TCGA Data Level]

all

attribute

Provides a method for the center to indicate the TCGA data level of a result file

TCGA Data Level

1

none

41

Comment [TCGA Archive Name]

all

attribute

Indicates the archive name that a file is contained in

TCGA Archive Name

mdanderson.org_OV.MDA_RPPA_Core.Level_1.1.0.0

none

42

Comment [TCGA File Type]

Protein Arrays

attribute

Provides a method for the center to indicate the file type of a result file. File types allow users or software to know how to process a file.

TCGA File Type

Array Slide Image (TIFF)

none

43

Comment [TCGA Biospecimen Type]

Protein Arrays

attribute

Used to describe a biospecimen UUID

TCGA Biospecimen Type

shipped portion

none

44

Comment [Antibody Name]

Protein Arrays

attribute

Identifies the antibody used on a RPLA array

Antibody Name

AKT_pS473(V)

Annotations File

45

Comment [TCGA MD5]

Protein Arrays

attribute

MD5 of a file

The md5 checksum hash of the associated file

f61735bc5307feec994f99c718f0223e

Any File column value listed in a MANIFEST: Array Data File, Derived Array Data File, Array Data Matrix File, Derived Array Data Matrix File, Image File

Required Headers

For RNASeq and miRNASeq experiments: see the TCGA Encyclopedia page for RNASeq.

For protein array experiments: see the Protein Array Data Format Specification.

For DNA array experiments: Extract Name

Required Comment Columns

For DNA array, RNASeq, and miRNASeq experiments:

Comment [TCGA Archive Name]
Comment [TCGA Data Level]
Comment [TCGA Data Type]
Comment [TCGA Include for Analysis]

For Protein array experiments:

Comment [TCGA Data Level]
Comment [TCGA Data Type]
Comment [TCGA File Type]

Validations

The following table shows the validations that are done on specific columns of the SDRF.

Column Name

Validation Rule

Default

miRNASeq

RNASeq

MDA_RPPA_Core

Comment [TCGA Archive Name]

Column must exist

Applicable

Applicable

Applicable

Applicable

Archive names listed under this column should be available in the latest or uploaded archives

Applicable

Applicable

Applicable

Applicable

Array Data File
Array Data Matrix File
Derived Array Data File
Derived Array Data Matrix File
Derived Data File

File columns should be followed by the following comment columns

  • Comment [TCGA Archive Name]
  • Comment [TCGA Data Level]
  • Comment [TCGA Data Type]
  • Comment [TCGA Include for Analysis]

Applicable

Applicable

Applicable

N/A

Files listed under each file column should exist

Applicable

Applicable

Applicable

Applicable

For each valid file column value (other than '->') the corresponding 'Comment [TCGA Data Level]’ column value should be a valid number

Applicable

Applicable

Applicable

N/A

For each valid file column value (other than '->') the 'Comment [TCGA Include for Analysis]’ value should be 'yes’ or 'no’

Applicable

Applicable

Applicable

N/A

For each valid file column value (other than '->') the Comment [TCGA Archive Name]’ column value should be valid archive name

Applicable

Applicable

Applicable

N/A

File columm value should be '->', if the corresponding 'Comment [TCGA Include for Analysis]’ value is 'no’

N/A

Applicable

Applicable

N/A

The 'Comment [TCGA Data Level]’ value should match the level specified in the Comment [TCGA Archive Name] value

Applicable

Applicable

Applicable

 

The 'Comment [TCGA Data Level]’ value should match the level specified in the Comment [TCGA Archive Name] value

Applicable

Applicable

Applicable

 

Derived Array Data Matrix File

Files listed under this column should contain tab delimited data

Applicable

Applicable

Applicable

Applicable

Files listed under this column should contain same number of elements in each line

Applicable

Applicable

Applicable

Applicable

All Level 1,Level 2, Level 3 and Level 4 files listed in the experiment archives should be referenced in the SDRF file

Applicable

Applicable

Applicable

Applicable

Extract Name
Comment [TCGA Barcode]
Material Type
Protocol REF
Assay Name
Annotation REF
Data Transformation Name
Derived Data File
Derived Data File REF
Comment [NCBI SRA Experiment Accession]
Comment [Genome reference]
Comment [NCBI dbGAP Experiment Accession]
Comment [TCGA Include for Analysis]
Comment [TCGA Data Type]
Comment [TCGA Data Level]
Comment [TCGA Include for Analysis]
Comment [TCGA Data Type]
Comment [TCGA Data Level]
Comment [TCGA Archive Name]

All columns except Comment [TCGA Barcode] currently must exist. Following the transition to UUID based identifiers, all columns must exist

N/A

Applicable

Applicable

N/A

Other than specified columns (except comment columns) new columns are not allowed

N/A

Applicable

Applicable

N/A

Columns should not contain blank values

N/A

Applicable

Applicable

N/A

Material Type

Column value should be one of the following element:

  1. Total RNA
  2. ->

N/A

Applicable

Applicable

N/A

Protocol REF

Column value should be one of the following element:

  1. Element matching domain:protocol:platform:version pattern
  2. ->

N/A

Applicable

Applicable

N/A

Comment [NCBI SRA Experiment Accession]

Column value should be one of the following element:

  1. Element matching 'SRX[0-9]{6}' pattern
  2. ->

N/A

Applicable

Applicable

N/A

Comment [NCBI dbGAP Experiment Accession]

Column value should be one of the following element:

  1. Element matching 'ph.[0-9]{6}\.v[0-9]\.p[0-9]' pattern
  2. ->

N/A

Applicable

Applicable

N/A

Annotation REF

Column value should be one of the following element:

  1. Valid URL
  2. ->

N/A

Applicable

Applicable

N/A

Assay Name

Column value should match the corresponding 'Extract Name’ column value

N/A

Applicable

Applicable

N/A

Multiple columns allowed

N/A

Applicable

Applicable

N/A

Term Source REF

Column value should match the 'Term Source Names’ in the IDF file

Applicable

Applicable

Applicable

Applicable

Extract Name

Column value should be either UUID or barcode

Applicable

Applicable

Applicable

N/A

Blank value is not allowed

Applicable

Applicable

Applicable

N/A

Value should not contain leading or trailing white spaces

Applicable

Applicable

Applicable

N/A

If the value is UUID, it should match the following pattern

Applicable

Applicable

Applicable

N/A

If the value is barcode, it should match The following pattern

Applicable

Applicable

Applicable

N/A

Barcode/UUID should already been submitted by BCR

Applicable

Applicable

Applicable

N/A

If the value is barcode , it should belong to the disease set for tumor type

Applicable

Applicable

Applicable

N/A

Column value should not contain leading or trailing whitespaces

Applicable

Applicable

Applicable

N/A

For each column value, there should be a corresponding 'mirna.quantification.txt’ and 'isoform.quantification.txt’ filenames in 'Derived Data File’ columns

N/A

Applicable

N/A

N/A

For each column value, there could be a corresponding '.wig’ filename in 'Derived Data File’ columns. If it doesn’t exist, do not fail validation but warn the user.

N/A

Applicable

Applicable

N/A

For each column value, there should be a corresponding 'gene.quantification.txt’, 'exon.quantification.txt’ and 'spljxn.quantification.txt’ filenames in 'Derived Data File’ columns.

N/A

N/A

Applicable

N/A

Source Name
Material Type
Term Source REF
Provider
Sample Name
Protocol REF
Extract Name
Array Name
Comment [TCGA Data Type]
Comment [TCGA Data Level]
Comment [TCGA File Type]
Comment [TCGA Antibody Name]
Comment [TCGA Include for Analysis]
Comment [TCGA Archive Name]
Comment [TCGA MD5]
Hybridization Name
Scan Name
Array Data File
Data Transformation Name
Derived Array Data File
Derived Array Data Matrix File
Normalization Name
Comment [TCGA Biospecimen Type]
Array Design File
Image File
Annotations File

Other than specified columns (except comment columns) new columns are not allowed

N/A

N/A

N/A

Applicable

The following columns must exist

  • Sample Name
  • Comment [TCGA Biospecimen Type]
  • Array Design File
  • Image File
  • Annotations File

N/A

N/A

N/A

Applicable

Each row should contain same number of elements as header

N/A

N/A

N/A

Applicable

Columns should not contain blank values

N/A

N/A

N/A

Applicable

Comment [TCGA Include for Analysis]

Column value should be one of the following elements:

  1. Yes
  2. No
  3. ->

N/A

N/A

N/A

Applicable

Comment [TCGA Data Type]

Column value should be one of the following elements:

  1. Annotations-Platform Design
  2. Expression-Protein
  3. Annotations-Antibodies
  4. ->

N/A

N/A

N/A

Applicable

Comment [TCGA File Type]

Column value should be one of the following element:

  1. Antibody Annotations (txt)
  2. Array Slide Image (TIFF)
  3. RPPA Slide Image Measurements (txt)
  4. SuperCurve Results (txt)
  5. MDA_RPPA Slide Design (txt)
  6. Normalized Protein Expression (MAGE-TAB data matrix)

N/A

N/A

N/A

Applicable

Array Design File
Array Data File
Derived Array Data File
Derived Array Data Matrix File
Image File
Annotations File

File columns should be followed by the following comment columns:

  • Comment [TCGA Data Level]
  • Comment [TCGA Data Type]
  • Comment [TCGA File Type]

N/A

N/A

N/A

Applicable

For each valid file column value (other than '->') the corresponding 'Comment [TCGA Data Level]’ column value should be a valid number

N/A

N/A

N/A

Applicable

Sample Name

Column must exist

N/A

N/A

N/A

Applicable

Multiple columns not allowed

N/A

N/A

N/A

Applicable

Column value should be one of the following element:

  1. Valid UUID
  2. ->

N/A

N/A

N/A

Applicable

If the column value is UUID, it should match the following pattern


N/A

N/A

N/A

Applicable

UUID should already been submitted by BCR

N/A

N/A

N/A

Applicable

Comment [TCGA Biospecimen Type]

Column must exist

N/A

N/A

N/A

Applicable

Multiple columns not allowed

N/A

N/A

N/A

Applicable

Column value should be one of the following element:

  1. Shipped Portion
  2. ->

N/A

N/A

N/A

Applicable

If the column value is 'Shipped Portion’, the corresponding element in 'Sample Name’ column should be a valid UUID

N/A

N/A

N/A

Applicable

Image File
Array Data File
Derived Array Data File
Derived Array Data Matrix File

File columns should be followed by the following comment columns

  • Comment [TCGA Include for Analysis]
  • Comment [TCGA Archive Name]

N/A

N/A

N/A

Applicable

Source Name
Material Type
Sample Name
Material Type
Extract Name
Material Type
Labeled Extract Name
Material Type
Hybridization Name
Scan Name
Normalization Name
Array Data File
Array Data
Derived Array Data File
Derived Array Data
Array Data Matrix File
Derived Array Data Matrix File
Image File
Array Design File
Array Design REF
Protocol REF
Provider
Label
Performer
Date
Description

 

 

 

 

 

Term Source REF
Label Term Source REF
Parameter Value
Unit
Factor Value
Characteristics
Comment
Data Format
Annotations File
Array Name
Data Transformation Name

Other than specified columns (except comment columns) new columns are not allowed

Applicable

N/A

N/A

N/A

Each row should contain same number of elements as header

Applicable

N/A

N/A

N/A

Columns should not contain blank values.

Applicable

N/A

N/A

N/A

Column must have at least one valid element other than '->’

Applicable

N/A

N/A

N/A

'Extract Name' column must exist

Applicable

N/A

N/A

N/A

Array Design REF

Multiple columns allowed

Applicable

N/A

N/A

N/A

Values should match

pattern

Applicable

N/A

N/A

N/A

Comment [TCGA Barcode]

For cases where Extract Name represents the UUID of a TCGA Analyte, the value in Comment [TCGA Barcode] must be the TCGA Barcode for that UUID. (Note by September 28 all submitted archives will be required to follow UUID Validation specifications. This will entail any Extracts that represent TCGA Analytes to have the UUID as Extract Name, followed by a column of "Comment [TCGA Barcode]" for the corresponding TCGA Barcode )

Applicable

Applicable

Applicable

N/A

 

Labels
  • None