Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Page tree
Skip to end of metadata
Go to start of metadata

Sequence trace files contain the raw data output from automated sequencing instruments.

Background

GSCs submit sequence trace files, with the associated experimental metadata, directly to NCBI Trace.

While TCGA data is available in this format, sequence trace files are no longer a primary source data for TCGA as the project has moved to next generation sequencing methods and file formats such as sequence read files.

Understanding Sequence Trace Files

As TCGA has moved to next generation sequencing methods, BAM files are the new sequence currency. The DCC tracks and provides the relationship between BAM files and biospecimen IDs. Trace files are available for TCGA glioblastoma multiforme (GBM) and ovarian serous cystadenocarcinoma (OV) projects.

Trace File Content and Use

Sequence trace files contain the raw data output from automated sequencing instruments. GSCs submit these files, with the associated experimental metadata, directly to NCBI Trace, a repository within NCBI for raw sequencing data. NCBI Trace assigns each sequence trace record a trace ID and provides that ID back to the submitting GSC. For more information, see NCBI Trace.

Trace files themselves are NOT submitted to the DCC, but the GSCs do transfer to the DCC the trace ID-to-sample relationship files that contain only the NCBI trace ID, (trace_id), and the aliquot barcode associated with the trace file submissions.

Trace File Format

Trace files are binary files that have the file extension .scf (sequence chromatogram format).

The following image from the NCBI Trace website displays an excerpt of a DNA sequence trace file opened in a Trace Archive page. You can download trace files, identified by {trace}.tar filenames, from the website.

Screen shot of trace file metadata example (from NCBI)

Understanding Trace ID-to-Sample Relationship Files

While TCGA data is available in this format, Trace ID-to-sample relationship files files are no longer a primary source data for TCGA as the project has moved to Next Generation Sequencing methods. BAM files are the new sequence currency. The DCC tracks and provides the relationship between BAM files and biospecimen IDs. Trace files are available for TCGA Glioblastoma multiform (GBM) and Ovarian serous cystadenocarcinoma (OV) projects.

Trace ID-to-Sample Relationship File Content and Use

Trace ID-to-sample relationship files link NCBI traces to the analyte from which the trace file was derived by pairing trace IDs with their respective aliquot barcodes. The files contain a listing of NCBI trace IDs and TCGA aliquot barcodes. This combination of data (trace IDs and aliquot barcodes) enables researchers to associate sample IDs/aliquot barcodes with assay results. GSCs transfer these trace ID-to-sample relationship files to the DCC for inclusion in the TCGA data repository for public access.

This data also enables the DCC to query NCBI Trace for additional metadata and to relate this metadata to other experimental results by mapping to BCR biospecimen barcodes.

Because trace ID-to-sample relationship files include BCR aliquot barcodes, researchers can track and record DNA information about a specific participant and make connections between their genes, chromosomal coordinates, tumor types, and so forth. To insure participant privacy, the DCC secures the trace relationship/aliquot barcode data in a separate data repository that is accessible to registered research organizations only via a secure FTP (SFTP) site.

Trace ID-to-Sample Relationship File Format

Trace ID-to-sample relationship files have the file extension .tr. The data in a trace ID to-sample relationship file is tab-delimited, with no leading spaces.

The files are modeled using the following ordered data elements as column headers:

  • trace_id (NCBI Trace is ti)
  • biospecimen_barcode (this is an aliquot barcode)

Example trace ID-to-sample relationship file name: broad.mit.edu_GBM.ABI.1.tr.

FASTA in Trace Files

A FASTA file is a text-based format used to represent either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. FASTA files are embedded in the trace files submitted to NCBI. NCBI Trace extracts the FASTA files and makes them available for download. NCBI provides a description of the format. FASTA files are outside the scope of this document.

  • No labels