Question: What Are Some Proposed Next-Generation Sequencing Features for Future caArray Releases?
Topic: caArray Usage
Release: [future]
Date entered: 05/03/2012
Answer
Must-Have Features
Features |
Explanations |
Support FASTQ, BAM, and VCF Formats |
NGS data are stored in those three formats. A FASTQ file contains both base calls and quality scores for sequence reads. It is usually compressed as gz file. A BAM file contains the alignment of multiple reads against a reference genome, as well as base calls and quality scores for each sequence read. It is usually compressed as well. The VCF file contains variant calls. |
Support large files (100GB) |
A FASTQ file is typically 100GB in size for whole genome sequencing and 5GB in size for exon sequencing (at medium depth coverage). A BAM file is normally around 100 GB. |
Need to track capture protocol and sequencing platform |
For many applications (e.g., exome sequencing, ChIP-seq) specific target genomic DNA regions (e.g., exons) have to be captured and sequenced. Knowledge of both the capture protocol and the sequencing platform is important. |
Need to save multiple protocols for RNA-seq |
There are many protocols that can be used to capture and filter RNA, such as ribosome RNA reduction, polyA pull-down. All applicable protocols should be saved along with an experiment/project. |
Need to capture the number of reads and the depth of coverage |
Metadata about what’s in a FASTQ and/or BAM file is informative for researchers. |
Support both single-end sequencing and paired-end sequencing |
For single-end sequencing, one FASTQ and one BAM file are produced. For paired-end sequencing, two FASTQ and one BAM file are produced. |
Support many-to-one mapping between FASTQ files and samples. |
For whole genome sequencing, usually several FASTQ files are produced for a single sample. |
Record the version of the reference genome sequence used in read alignment |
The version of the reference genome used for mapping and base calling is extremely important and should be provided for each BAM file. |
Support for cufflinks output file |
It is desirable to be able to retrieve FPKM data for genes or transcripts. |
Nice-To-Have Features
Features |
Explanations |
Connect to Galaxy or other data analysis tools |
Once data are stored in caArray, this connectivity can make it easy for researchers to conduct subsequent data analysis. |
Ability to search with an aggregate of attributes |
Right now caArray only allows searching with one attribute such as experiment title, sample name, etc. It will be nice to enable search with multiple attributes. |
Ability to download through Aspera or something more efficient than ftp for large archives and perform MD5 checksum |
Data uploading and downloading will be time consuming given the file size. Faster is always better. |
Have a comment?
Please leave your comment in the caArray End User Forum.