Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Skip to end of metadata
Go to start of metadata

Sequence-based data (within the scope of TCGA) are sequencing data produced by GSCs using high-throughput sequencing platforms.

About Sequence-Based Data

TCGA sequence data is created by data-generating centers using various platforms targeting, for example, the whole genome, the exome and micro-RNA (miRNA). These centers use the sequence data to identify variants in genes or the genome by comparing tumor-sample results to normal-sample results and a reference. Some variants include:

  • germline and somatic mutations
  • single nucleotide variants/polymorphisms 
  • insertions and deletions (collectively known as in-dels)
  • copy number variations
  • translocations 
  • inversions

In addition to identifying variants, RNASeq and miRNASeq produce quantification data, for example gene or miRNA expression. For more information, see RNASeq.

Data File Submissions

The data-generating centers that use sequencing platforms generate the following sequence-derived data files:

CGHub Deposit Site

File Type

File Suffix

Data Level

Description

Sequence Read Files

various or fastq

 

Sequence read data in their native platform formats (e.g. AB SOLiD, Illumina) or FASTQ Exit Disclaimer logo format.

Binary-sequence Alignment Format (BAM) files

bam

1

A Binary Alignment/Map (BAM) file is the compressed binary version of the Sequence Alignment/Map (SAM), a compact and indexable representation of nucleotide sequence alignments.

Sequence Trace files

scf

1

Sequence trace files contain the raw data output from automated sequencing instruments.


DCC Deposit Site

File Type

File Suffix

Data Level

Description

Wiggle (WIG) format files

wig

2

The wiggle (WIG) format describes dense, continuous data such as sequence coverage, GC percent, and probability scores.

Mutation Annotation Format (MAF) files

maf

2 or 3

A Mutation Annotation Format (MAF) is a tab-delimited file containing somatic and/or germline mutation annotations. MAF files containing any germline mutation annotations are kept in the controlled access portion of the Data Portal, MAF files containing only somatic mutations are kept in the open access portion of the Data Portal. MAF files are considered Level 2 files.

Variant Call Format (VCF) files

vcf

2 or 3

The Variant Call Format (VCF) is a standardized format for storing and reporting genomic sequence variations.

Trace ID-to-sample relationship files

tr

1

The page Trace ID-to Sample Relationship File does not exist.

Verbose Coverage File

 

  1

 

 

 

vcf

2

A verbose coverage file (VCF) provides sequence depth at a mutation locus described in a MAF file.

Quantification files

quantification.txt

3

A quantification file provides calculated values for a particular data type based on sequence data. The current data types and quantification formats are based on RNA sequencing results.

Mapping Sequence-Based Data

Currently, there is no equivalent to an SDRF file for sequence-based (GSC) data. GSC Mutation Annotation Format (MAF) files does provide the relationship between aliquot barcodes and associated called variants.

  • No labels