NIH | National Cancer Institute | NCI Wiki  

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Wiki Markup
{scrollbar:icons=false}

h1. Question: What Are Some Proposed 
Scrollbar
iconsfalse

...

Next-Generation Sequencing Features for Future caArray Releases?

...



*Topic*: caArray Usage

...




*Release*: future caArray release

...





*Date entered*: 05/03/2012

...

Answer

...





h2. Answer

Several research groups have inquired about adding support to next-generation sequencing data in caArray. We have hosted two requirements gathering meetings with potential NGS data users at Columbia University. The users we interviewed belong to two different categories and represent different types of needs. One represents the core facilities and the other represents investigator labs that are heavy users of NGS data. The following two tables summarize features that they deem important. Please add your comments and features you would like to have to the corresponding forum thread.

...



h3. Must-Have Features

...

Features

Explanations

Support FASTQ, BAM, and VCF Formats

NGS data are stored in those three formats. A FASTQ file contains both base calls and quality scores for sequence reads. It is usually compressed as gz file. A BAM file contains the alignment of multiple reads against a reference genome, as well as base calls and quality scores for each sequence read. It is usually compressed as well. The VCF file contains variant calls.

Support large files (100GB)

A FASTQ file is typically 100GB in size for whole genome sequencing and 5GB in size for exon sequencing (at medium depth coverage). A BAM file is normally around 100 GB.

Need to track capture protocol and sequencing platform

For many applications (e.g., exome sequencing, ChIP-seq) specific target genomic DNA regions (e.g., exons) have to be captured and sequenced. Knowledge of both the capture protocol and the sequencing platform is important.

Need to save multiple protocols for RNA-seq

There are many protocols that can be used to capture and filter RNA, such as ribosome RNA reduction, polyA pull-down. All applicable protocols should be saved along with an experiment/project.

Need to capture the number of reads and the depth of coverage

Metadata about what’s in a FASTQ and/or BAM file is informative for researchers.

Support both single-end sequencing and paired-end sequencing

For single-end sequencing, one FASTQ and one BAM file are produced. For paired-end sequencing, two FASTQ and one BAM file are produced.

Support many-to-one mapping between FASTQ files and samples.

For whole genome sequencing, usually several FASTQ files are produced for a single sample.

Record the version of the reference genome sequence used in read alignment

The version of the reference genome used for mapping and base calling is extremely important and should be provided for each BAM file.

Support for cufflinks output file

It is desirable to be able to retrieve FPKM data for genes or transcripts.

Nice-To-Have Features

Features

Explanations

Connect to Galaxy or other data analysis tools

Once data are stored in caArray, this connectivity can make it easy for researchers to conduct subsequent data analysis.

Ability to search with an aggregate of attributes

Right now caArray only allows searching with one attribute such as experiment title, sample name, etc. It will be nice to enable search with multiple attributes.

Ability to download through Aspera or something more efficient than ftp for large archives and perform MD5 checksum

Data uploading and downloading will be time consuming given the file size. Faster is always better.

Have a comment?

Please leave your comment in the caArray End User Forum.

...



| _Features_ | _Explanations_ |
| Support FASTQ, BAM, and VCF   Formats | NGS data are stored in   those three formats. A FASTQ file contains both base calls and quality scores   for sequence reads. It is usually compressed as gz file. A BAM file contains   the alignment of multiple reads against a reference genome, as well as base   calls and quality scores for each sequence read. It is usually compressed as   well. The VCF file contains variant calls. |
| Support large files (100GB) | A FASTQ file is typically   100GB in size for whole genome sequencing and 5GB in size for exon sequencing   (at medium depth coverage). A BAM file is normally around 100 GB. |
| Need to track capture protocol   and sequencing platform | For many applications   (e.g., exome sequencing, ChIP-seq) specific target genomic DNA regions (e.g.,   exons) have to be captured and sequenced. Knowledge of both the capture   protocol and the sequencing platform is important. |
| Need to save multiple   protocols for RNA-seq | There are many protocols that   can be used to capture and filter RNA, such as ribosome RNA reduction, polyA   pull-down. All applicable protocols should be saved along with an   experiment/project. |
| Need to capture the number   of reads and the depth of coverage | Metadata about what’s in a   FASTQ and/or BAM file is informative for researchers. |
| Support both single-end   sequencing and paired-end sequencing | For single-end sequencing,   one FASTQ and one BAM file are produced. For paired-end sequencing, two FASTQ   and one BAM file are produced. |
| Support many-to-one mapping   between FASTQ files and samples. | For whole genome   sequencing, usually several FASTQ files are produced for a single sample. |
| Record the version of the reference   genome sequence used in read alignment | The version of the   reference genome used for mapping and base calling is extremely important and   should be provided for each BAM file. |
| Support for cufflinks   output file | It is desirable to be able   to retrieve FPKM data for genes or transcripts. |

h3. Nice-To-Have Features

| _Features_ | _Explanations_ |
| Connect to Galaxy or other   data analysis tools | Once data are stored in   caArray, this connectivity can make it easy for researchers to conduct   subsequent data analysis. |
| Ability to search with an   aggregate of attributes | Right now caArray only   allows searching with one attribute such as experiment title, sample name,   etc. It will be nice to enable search with multiple attributes. |
| Ability to download through   Aspera or something more efficient than ftp for large archives and perform   MD5 checksum | Data uploading and   downloading will be time consuming given the file size. Faster is always   better. |


h2. Have a comment?


Please leave your comment in the [caArray End User Forum|https://cabig-kc.nci.nih.gov/Molecular/forums/viewtopic.php?f=6&t=773&start=0&sid=7bcdeb3fa07b3e124d2d561ff1124731].



{scrollbar:icons=false}