Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Skip to end of metadata
Go to start of metadata

 

 

Table of Contents

Definition

RNASeq data contains information about both nucleotide sequence and gene expression. For a discussion of the types of data produced by these kinds of platforms see RNASeq Data Format Specification. Also see the entry for RNASeq Version 2.

Data Overview

Data derived from the sequencing of RNA is one of the sources of gene expression data collected by TCGA. Currently, the Level 3 data is created using two distinct methods. The original method followed the RPKM (Reads Per Kilobase of exon model per Million mapped reads) method of quantiation. The newer version 2 data (RNASeqV2, introduced in May 2012) used a combination of MapSplice and RSEM to determine expression levels. In the near future, this data will also be used to identify variants such as SNPs or indels.

Data File Descriptions

Available Platforms

  • Platform Code - used in archive names.
  • Platform Alias - used to group similar platforms and used in some applications to save space when referring to platforms.
  • Platform Name - full name of platform
  • Available - indicates whether data is available at the DCC for a platform
  • HTTP Display - indicates the directory name data will be deposited in the HTTP directory structure

Platform Code

Platform Alias

Platform Name

Available

IlluminaGA_RNASeq

IlluminaGA_RNASeq

Illumina Genome Analyzer RNA Sequencing

Yes

IlluminaGA_RNASeqV2IlluminaGA_RNASeqV2Illumina Genome Analyzer RNA Sequencing Version 2 analysisYes

IlluminaHiSeq_RNASeq

IlluminaHiSeq_RNASeq

Illumina HiSeq 2000 RNA Sequencing

Yes

IlluminaHiSeq_RNASeqV2IlluminaHiSeq_RNASeqV2Illumina HiSeq 2000 RNA Sequencing Version 2 analysisYes
IlluminaHiSeq_TotalRNASeqV2IlluminaHiSeq_TotalRNASeqV2Illumina HiSeq 2000 Total RNA Sequencing Version 2 analysisYes

IlluminaGA_mRNA_DGE

IlluminaGA_mRNA_DGE

Illumina Genome Analyzer mRNA Digital Gene Expression

Yes

IlluminaHiSeq_mRNA_DGE

IlluminaHiSeq_mRNA_DGE

Illumina HiSeq 2000 mRNA Digital Gene Expression

Yes

Available Data Files

  • Platform Code - used in archive names.
  • Data Level - The TCGA data level (1-3)
  • File Type - The file extension and content type
  • Description - The scientific content of the file

Platform Code

Data Level

File Type

Description

IlluminaGA_RNASeq

Level 3

Tab-delimited ASCII Text

  1. exon.quantification.txt
  2. gene.quantification.txt
  3. spljxn.quantification.txt
  4. .wig

 

  1. The calculated expression signal of a gene.
  2. The calculated expression signal of a particular composite exon of a gene.
  3. The calculated expression signal of a particular composite splice junction of a gene.
  4. Wiggle coverage file
IlluminaGA_RNASeqV2

Level 3

Tab-delimited ASCII Text

  1. exon.quantification.txt
  2. junction_quantification.txt
  3. rsem.genes.normalized_results
  4. rsem.isoforms.normalized_results
  5. rsem.genes.results
  6. rsem.isoforms.results
 

IlluminaHiSeq_RNASeq

Level 3

Tab-delimited ASCII Text

  1. exon.quantification.txt
  2. gene.quantification.txt
  3. spljxn.quantification.txt
  4. .wig

 

  1. The calculated expression signal of a gene.
  2. The calculated expression signal of a particular composite exon of a gene.
  3. The calculated expression signal of a particular composite splice junction of a gene.
  4. Wiggle coverage file
IlluminaHiSeq_RNASeqV2Level 3

Tab-delimited ASCII Text

  1. exon.quantification.txt
  2. junction_quantification.txt
  3. rsem.genes.normalized_results
  4. rsem.isoforms.normalized_results
  5. rsem.genes.results
  6. rsem.isoforms.results
 
IlluminaHiSeq_TotalRNASeqV2Level 3

Tab-delimited ASCII Text

  1. exon.quantification.txt
  2. junction_quantification.txt
  3. rsem.genes.normalized_results
  4. rsem.isoforms.normalized_results
  5. rsem.genes.results
  6. rsem.isoforms.results
 

IlluminaGA_RNASeq

Level 2

Tab-delimited ASCII text
.vcf

Variant Call Format file

IlluminaHiSeq_RNASeq

Level 2

Tab-delimited ASCII text
.vcf

Variant Call Format file

IlluminaGA_mRNA_DGE

Level 1
Level 2
Level 3

Tab-delimited ASCII text and plain ASCII text

  1. _frequency.txt
  2. _genes.txt
  3. _sequence.txt
  4. _tags.txt

 

  1. Level 2 Data: Tab delimited tag frequency
  2. Level 3 Data: Tab delimited description of tag/gene relationship
  3. Level 1 Data: Not tab delimited
  4. Level 3 Data: Tab delimited description of tags

IlluminaHiSeq_mRNA_DGE

no data expected

no data expected

The DCC currently has no data for this platform and is unaware of any potential submissions

Validations

Level 3 data

RNASeq

The table below lists the specific QCLive Java software components (class files) for RNASeq data file validation.

Component Name

Validates Data File Type

Description

RNASeqDataFileValidator

RNASeq – All

Performs common validation across all RNASeq data file types

RNASeqExonFileValidator

exon.quantification.txt

exon_quantification.txt

Performs validation of RNASeq Exon specific data files

RNASeqGeneFileValidator

gene.quantification.txt

Performs validation of RNASeq Gene specific data files

RNASeqJunctionFileValidator

spljxn.quantification.txt

junction_quantification.txt

Performs validation of RNASeq Splice Junction specific data files

RNASeqRSEMGeneNormalizedFileValidatorrsem.genes.normalized_resultsPerforms validation of RNASeq RSEM Genes normalized results data files
RNASeqRSEMGeneResultsFileValidatorrsem.genes.resultsPerforms validation of RNASeq RSEM Gene results data files
RNASeqRSEMIsoformFileValidatorrsem.isoforms.resultsPerforms validation of RNASeq RSEM Isoforms results data files
RNASeqRSEMIsoformNormalizedFileValidatorrsem.isoforms.normalized_resultsPerforms validation of RSEM Isoforms normalized results data files

The specific RNASeq data file input types that are validated by the RNASeq Java software components are listed below.


Note: The RNASeqDataFileValidator is omitted from the list, but applies to all RNASeq data files.

Data Type

Filename Format

Validators

Exon

<domain>.<TCGA aliquot barcode/UUID>.<center_token>.<index_integer>.trimmed.annotated.exon.quantification.txt

RNASeqExonFileValidator

Gene

<domain>.<TCGA aliquot barcode/UUID>.<center_token>.<index_integer>.trimmed.annotated.gene.quantification.txt

RNASeqGeneFileValidator

Splice Junction

<domain>.<TCGA aliquot barcode/UUID>.<center_token>.<index_integer>.trimmed.annotated.spljxn.quantification.txt

RNASeqJunctionFileValidator

  • <domain> refers to the submitting institutes internet domain. For example, broad.edu or bcgsc.ca
  • <center_token> refers to an identifier that individual institutions may use for internal purposes. In general these are not TCGA identifiers.

RNASeq Validation Rules

RNASeqDataFileValidator

Name

Location/
Type

Validation
Rules

Example

barcode
(aliquot)

Filename:
barcode

Validates
that the barcode
part of all
RNASeq data
file names
has a valid
aliquot
barcode
format

  • TCGA-AB-2803-03A-01T-0734-13.exon.quantification.txt
  • TCGA-AB-2803-03A-01T-0734-13.gene.quantification.txt
  • TCGA-AB-2803-03A-01T-0734-13.spljxn.quantification.txt

raw_counts
median_length_
normalized
RPKM

File
Content:
Column
Header

Validates
the following
for each:

  • column header
    exists and
    matches the
    expected column
    header name
  • rows have
    data
    corresponding
    to the column
    header

Example File
image of example

raw_counts:
value

File
Content:
Column
Value

Validates
that the
value
represents
a floating
point, non-
negative number
(e.g. 0.314)

image of example

median_
length_
normalized:
value

File
Content:
Column Value

Validates
that the value
represents a
non-negative
floating point
number

image of example

RPKM:value

File
Content:
Column Value

Validates
that the
value
represents a
non-negative 
floating point
number

image of example


RNASeqExonFileValidator

Name

Location/Type

Validation Rules

Example

exon

File Content:
Column Header

Performs
the same
validation
as the
RNASeqDataFileValidator
for "File Content:
Column Header"
types with the
addition of the
"exon" column

Example File

exon:value

File Content:
Column Value

Validates that
value is in the format:
{chrom}:{coord}:{strand}

where:


{chrom} = chromosome
name corresponding
to one of the chromosome
names listed in the database
chromosome reference
table 


{coord} =
set of non-negative
integers separated
by a dash '-'
(e.g.
"11874-12227")

{strand} =
single character
that is either '+'
(plus) or '-'
(minus)

image of example


RNASeqGeneFileValidator

Name

Location/Type

Validation Rules

Example

gene

File Content:Column Header

Performs the same validation as the RNASeqDataFileValidator for "File Content:Column Header" types with the addition of the "gene" column

Example File

gene:value

File Content:Column Value

No validation performed

image of example

RNASeqJunctionFileValidator

Name

Location/Type

Validation Rules

Example

junction
raw_counts

File Content:Column Header

Validates the following for each:

  • column header exists and matches the expected column header name
  • rows have data corresponding to the column header

Example File

junction:value

File Content:Column Value

Validates that value is in the format:

{chrom}:{coord}:{strand},
{chrom}:{coord}:{strand}

where:

{chrom} = chromosome name corresponding to one of the chromosome names listed in the database
chromosome referencetable 
{coord} = non-negative integer

{strand} = single character that is either '+' (plus) or '-' (minus)

image of example

raw_counts:value

File Content:Column Value

Performs the same validation as the RNASeqDataFileValidator for "raw_counts:value"

image of example

IlluminaGA_mRNA_DGE Validation Rules

Data from this platform have not been submitted since 2009. While these archives would be validated according to the general validation rules, there are no platform specific validation rules.

Standard Archive Validations

All RNASeq and IlluminaGA_mRNA_DGE data are processed using a standard set of validations. Data from RNA Sequencing follow the GCC route.

The validation sets run on all RNASeq data are listed below:

Standard MAGE-TAB File Validations

This data group includes MAGE-TAB archives and documents. All MAGE-TAB archive validations are covered under Standard Archive Validations. All MAGE-TAB documents submitted to the DCC are processed using a standard set of validations.

Standard Result File Validations

MAGE-TAB Data Matrix file

MAGE-TAB Data Matrix format validations are covered in Standard MAGE-TAB File Validations.

Variant Calling Format (VCF) file

VCF Validation Rules
VCF File Spec

Wiggle (WIG) format file

Wiggle files contained in RNASeq archives are accepted, but not validated. A validator was written but removed due to the burden wiggle file validation put on the system. There is a specification covering wiggle files.

Level 2 data Validation 

There will be two archives for RNA-seq based VCF submissions: the data archive and the MAGE-TAB (metadata) archive. Naming conventions will be similar to typical archives.

<domain>_<disease_study>.<platform>.Level_2.<index>.<revision>.<series>
<domain>_<disease_study>.<platform>.mage-tab.1.<revision>.<series>

Example:
bcgsc.ca_COAD.IlluminaHiSeq_RNASeq.Level_2.1.0.0
bcgsc.ca_COAD.IlluminaHiSeq_RNASeq.mage-tab.1.0.0
  • <domain> refers to the submitting institutes internet domain. For example, broad.edu or bcgsc.ca
  • <disease_study> is the disease abbreviation. For example BRCA or OV.
  • <platform> refers to the experimental platform such as IlluminaGA.
  • <index>,<revision> and <series> are integers used to control the archive version.

Validity for archive structure

  • Archives will be gzipped tars
  • Archives will be flat, below a directory named after the archive
  • Archives will contain a MANIFEST.txt file; each file in the archive (with the optional exception of the MANIFEST.txt file) will be represented in the MANIFEST.txt as the output of md5sum run against the file.

Level 2 Archives

Level 2 MAGE-TAB Archives

In addition to a data archive, a MAGE-TAB archive will be required for all RNASeq Level 2 submission. The MAGE-TAB archive will require:

  • A MANIFEST.txt file
  • A DESCRIPTION.txt file
  • An IDF file
  • An SDRF file

The IDF and SDRF Files will be named as follows:

<domain>_<disease_study>.<platform>_RNASeq.<index>.<revision>.<series>.idf.txt
<domain>_<disease_study>.<platform>_RNASeq.<index>.<revision>.<series>.sdrf.txt

Level 2 Data Archives

In addition to the MANIFEST.txt file, a complete Level 2 RNA-Seq data archive will contain

  • a single DESCRIPTION.txt file (optional)
  • a VCF format file for each aliquot (required)

Data type

File name

Platforms

Validation

VCF (variant)

<domain>.<TCGA barcode/UUID>.<center_token>.<index_integer>.vcf

  • IlluminaHiSeq_RNASeq

Filename must end with ".vcf"

  • No labels