{scrollbar:icons=false}

h1. Question: What Are Some Proposed Next-Generation Sequencing Features for Future caArray Releases?

*Topic*: caArray Usage


*Release*: future caArray release



*Date entered*: 05/03/2012



h2. Answer

Several research groups have inquired about adding support to next-generation sequencing data in caArray. We have hosted two requirements gathering meetings with potential NGS data users at Columbia University. The users we interviewed belong to two different categories and represent different types of needs. One represents the core facilities and the other represents investigator labs that are heavy users of NGS data. The following two tables summarize features that they deem important. Please add your comments and features you would like to have to the corresponding forum thread.

h3. Must-Have Features

| _Features_ | _Explanations_ |
| Support FASTQ, BAM, and VCF   Formats | NGS data are stored in   those three formats. A FASTQ file contains both base calls and quality scores   for sequence reads. It is usually compressed as gz file. A BAM file contains   the alignment of multiple reads against a reference genome, as well as base   calls and quality scores for each sequence read. It is usually compressed as   well. The VCF file contains variant calls. |
| Support large files (100GB) | A FASTQ file is typically   100GB in size for whole genome sequencing and 5GB in size for exon sequencing   (at medium depth coverage). A BAM file is normally around 100 GB. |
| Need to track capture protocol   and sequencing platform | For many applications   (e.g., exome sequencing, ChIP-seq) specific target genomic DNA regions (e.g.,   exons) have to be captured and sequenced. Knowledge of both the capture   protocol and the sequencing platform is important. |
| Need to save multiple   protocols for RNA-seq | There are many protocols that   can be used to capture and filter RNA, such as ribosome RNA reduction, polyA   pull-down. All applicable protocols should be saved along with an   experiment/project. |
| Need to capture the number   of reads and the depth of coverage | Metadata about what’s in a   FASTQ and/or BAM file is informative for researchers. |
| Support both single-end   sequencing and paired-end sequencing | For single-end sequencing,   one FASTQ and one BAM file are produced. For paired-end sequencing, two FASTQ   and one BAM file are produced. |
| Support many-to-one mapping   between FASTQ files and samples. | For whole genome   sequencing, usually several FASTQ files are produced for a single sample. |
| Record the version of the reference   genome sequence used in read alignment | The version of the   reference genome used for mapping and base calling is extremely important and   should be provided for each BAM file. |
| Support for cufflinks   output file | It is desirable to be able   to retrieve FPKM data for genes or transcripts. |

h3. Nice-To-Have Features

| _Features_ | _Explanations_ |
| Connect to Galaxy or other   data analysis tools | Once data are stored in   caArray, this connectivity can make it easy for researchers to conduct   subsequent data analysis. |
| Ability to search with an   aggregate of attributes | Right now caArray only   allows searching with one attribute such as experiment title, sample name,   etc. It will be nice to enable search with multiple attributes. |
| Ability to download through   Aspera or something more efficient than ftp for large archives and perform   MD5 checksum | Data uploading and   downloading will be time consuming given the file size. Faster is always   better. |


h2. Have a comment?


Please leave your comment in the [caArray End User Forum|https://cabig-kc.nci.nih.gov/Molecular/forums/viewtopic.php?f=6&t=773&start=0&sid=7bcdeb3fa07b3e124d2d561ff1124731].



{scrollbar:icons=false}