NIH | National Cancer Institute | NCI Wiki  

Date

Attendees



Agenda

  1. What non-cancer data sets are available that we want to link/bring in to ICDC?
  2. Do we bring in the raw data (BAMs/BAIs) or just the processed data (VCFs)?
  3. What analysis do we want to do with this data?

Discussion items

TimeItemWhoNotes

Non-Cancer Data Sets
Elaine's reference genomes and whole genome sequences.  Collated together with VCF files.  Data has been vetted, used and is growing in size.  A lot of that is breed-specific data.  Caveat - there isn't always long-term health follow-up.  More population based.  Dog 10K (3000 sequences).  Elaine has released theirs without restraints.  New reference genomes (Great Dane (Jeff Kidd), German Shepherd).  Waiting for annotation to be completed on Tasha. Ancient dog nuclear data is becoming available (paper in Science, Gregor Larson, Julie Meachem). Golden Retriever (german group).  Question if they are long-read sequences. Are they true de novo assemblies or aligned to Tasha (which may propagate errors). Genetech has RNA seq data from normal organs from dogs (Debbie). EqTL (Expression) studies ongoing (Elaine), most are testes, but do involve fairly large numbers of studies.  Dawn - Barkbase: RNA seq data from different tissues (Broad), batch of normals at CSU, large Chinese study (included in Elaine's VCF).  Asian dog DNA also available.

Raw Data or Processed Data?
Start off by putting VCF file from Elaine on ICDC (1400-1500 sequences).  Provide links to other opportunities.  Provide links to reference genomes. Want to discourage people from using old dbSNP database for dogs. Would like alternative splicing and genotype data. Advantage of Tasha new long-read assembly is exact dog everyone has been using for 10 years. Elaine can ask Alex in her lab to provide the links.

Analyses to be done?
If someone goes to ICDC and they are mining the data and want to compare to non-tumor bearing (normal) datasets, does the link suffice? Perhaps we can link to the datasets and bring them into SBG. Erika: need to define use cases so we know how best to make the data available.  Difference between VCF files and starting with raw data. SRA does make data available in Google/Amazon, but SRA will not keep cloud data forever unless it has a high traffic. Making it available from ICDC gives us control of the data availability. Suggest we come up with 3-5 analyses use cases.  Once we have links, we can discuss landscape of non-cancer dog data.  Need members to come up with use cases for how to utilize the data. ICDC not currently designed to display VCF data. Dawn: need to harmonize with pipeline from Genomics WG of BPSC - would need to provide resources to harmonize. Can we compare harmonized pipeline with Elaine's analysis? Elaine has published filters in paper.  She would need to talk to Alex about their BAM files. Alex can come on next call and discuss filters. Too much to re-analyze the BAM files. Variant filter information is very important and Alex can go through the pipeline analysis. https://www.nature.com/articles/s41467-019-09373-w

Action items

  • Members to come up with Use Cases to understand what analyses we want to perform.
  • Alex (Elaine's lab) to provide links to data (Elaine to send email to Matt).
  • Setup next meeting and invite Alex to talk about pipeline. (Wed, Oct. 14, noon).