Date
Attendees
Alex Harris, Dawn Duval, Erika Kim, Debbie Knapp, Elaine Ostrander
Agenda
- Alex – present pipelines used in NHGRI paper
- Discuss use cases for analysis and what tools might be needed in Cloud Resources
- What data should we store/share? Raw vs. processed?
- Next steps?
Discussion items
Time | Item | Who | Notes |
---|---|---|---|
Pipeline | Raw data is not user-friendly. Takes up less database space. Processed data is more user-friendly. BAM files are large and take up a lot of database space. Pipeline: preprocessing: take raw reads (uBAM or FASTQ), concatenate to proper paired-end reads suitable for BWA-MEM. Map to reference (aligned to CanFam3.1). Raw mapped reads (BAM). Mark duplicates (Picard tools) - deduplicated BAM. Recalibrate Base Quality Scores. Analysis-Ready Reads (BAM) - uses GATK 4.0+ suite to perform BaseRecalibrator and ApplyBQSR tools. Uses CanFam3.1 as a reference. Main pipeline: take analysis read BAMs, look for variation within BAM (Haplotype Caller in GATK 4+). Determines haplotypes, determines likelihoods, assigns sample genotype based fro the read data. Use GenomicsDBImport to gather the experimental gVCFs to create an object-oriented database. Most computing intensive part. Can get user unfriendly. Use GenotypeGVCFs to perform joint genotyping. GatherVCFs to create final VCF. VariantRecalibrator to create a recalibration table to be used in the ApplyVQSR tool and a tranches (100, 99.9, 99.0, 90.0) files that shows various metrics of the callset for slices of the data. Put indels in confidence intervals (tranches). Upfront decisions about reference datasets is very important. Filter settings also very important. ICDC could host dbSNP dataset which is hard to find elsewhere. Dawn would prefer to move away from this. Elaine's dataset is germline only - no somatic data. | ||
To potential next steps: look into Alex's pipeline and compare to Genomics WG best practices and see if there is an interest in bringing some of this analysis to SBG; making germline dataset available to ICDC and SBG to have normal germline data available to filter against. Elaine - really interested, but Alex is deployed for Dog10K (18 Countries, 10 sites) which is making a robust dataset. Erika asked Alex to share his slides to share with SBG because they may have most of these tools already and stitching them together with the right filters and settings. Dawn suggests that if we could get the 722 dog dataset, that would be a great place to start. Alex has 1399 dog dataset he can publicly release. Alex has already done the hard work on this. What would be the process to release this dataset? Send the VCF to ICDC or deposit in SRA? Elaine says sequences are at NCBI (SRA). Point from SBG/ICDC to SRA sequences. 1399 VCF file can be hosted in ICDC. 722 doesn't include Dog10K. | |||
Next Steps | Get 722 going now and add 1399 when available. https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?analysis=SRZ189891 (link to 722). Erika and Matt to talk to ICDC data team about how to put it in. | ||
Use Cases | How to use non-cancer data as reference data?
|