Welcome to the CBIIT Speaker Series Wiki
Our ability to deeply investigate the cancer genome is outpacing our ability to relate these changes to the phenotypes that they produce. Transformational change is possible but we will need to address several fundamental challenges including: (1) accurate phenotyping across entire populations of cancer patients, (2) sharing of clinical, imaging, and sequencing data associated with cancer biospecimens, and (3) processing of complex, high-dimensional data in combination with clinical data. In this CBIIT talk, I will share our experiences in two different open-source, NCI-funded projects to develop technology that can help address these fundamental challenges:
The TIES Cancer Research Network is a federated network of Cancer Centers that enables collaborative access to deidentified and NLP-processed data, images, and biospecimens across all institutions. A network “trust” agreement among all TCRN institutions, and policies for managing the network make it possible for investigators to easily access this large dataset. TCRN is based on a scalable model that could support a national clinical data and resource sharing network for Precision Medicine.
The Cancer Deep Phenotyping project (DeepPhe) is a new collaboration with the Boston Children’s Hospital cTAKES team, that focuses on development of advanced methods for phenotype extraction and representation. Expected outcomes of this project will include software pipelines for processing clinical documents to extract summarizations of key cancer phenotype variables over time including stage, tumor extent, recurrence and outcome.
The cost of DNA sequencing has dropped more than one-million-fold over the last decade, making it increasingly possible to discover the genetic basis of cancer and response to treatment. Three challenges impede this goal: 1) Analysts lack the resources to download, store and compute on the data; 2) Existing tools and infrastructure have not been designed to scale to handle petabytes or exabytes; and 3) Collaboration is hindered by the current model of storing data locally.
The large-scale sequencing efforts of TCGA has begun to elucidate the genetic pathogenesis of cancer, enabling the development of targeted therapies. However, to enter an era of true “precision medicine,” we need to create sophisticated information technologies to store, analyze, and share data. FireCloud offers a solution to these needs.
FireCloud democratizes data access and facilitates collaboration by providing a robust, scalable platform accessible to the community at large. Using the elastic compute capacity of Google Cloud, FireCloud empowers analysts, tool developers and production managers to perform large-scale analysis, engage in data curation, and store or publish results. FireCloud is modeled after Firehose, an analysis on-premesis infrastructure developed by the Getz Lab at the Broad Institute’s.
As in Firehose, workspaces are central to the FireCloud architecture. Workspaces are computational sandboxes that enable users to organize genomic data and metadata into a data model, run analysis methods, and view results. Users can upload their own analysis methods to workspaces or import the Broad institute’s best practice tools and pipelines. FireCloud will include tutorial workspaces, and carefully curated Open and Controlled Access TCGA workspaces which users can clone.
FireCloud will enable the mission of TCGA and other cancer genome projects by provisioning workspaces with curated data and best practice tools and pipelines. This will empower researchers across the globe to explore the TCGA data in new and innovative ways which will increase opportunities to novel contributions to cancer research.
The Seven Bridges Cancer Genomics Cloud pilot is one of three pilot projects funded by the National Cancer Institute. The overarching goal of the project is to explore how co-localizing large genomics datasets, like The Cancer Genome Atlas, with dynamic compute infrastructure to analyze them, can make learning from these data faster, and ultimately enable precision medicine.
In this seminar we’ll highlight four guiding principles that have driven development of the Seven Bridges CGC:
Making data available isn’t enough to make it usable: We’ve built a dynamic query engine that allows fast search of more than 140 clinical and biospecimen properties to enable finding interesting TCGA data faster and easier. Importantly, data are immediately available for analysis at scale using both pre-defined and custom workflows.
The best science happens in teams: A fine-grained permissions model allows transparent collaboration; in a secure and compliant manner.
Reproducibility shouldn’t be hard: Each analysis, including all parameters, files, and software versions is fully logged and can be perfectly replicated days or months later.
The impact of TCGA is amplified by new data and tools: Researchers can readily bring their own data, and their own tools to analyze alongside TCGA data. Native implementation of the Common Workflow Language (CWL) specification enables portability of tools and workflows to and from other CWL-compliant systems.
The seminar will include a demo of the system and interested researchers can visit www.cancergenomicscloud.org to get involved.
The Institute for Systems Biology (ISB) Cancer Genomics Cloud (ISB-CGC) is one of three pilot projects funded by the NCI with the goal of democratizing access to the TCGA data by substantially lowering the barriers to accessing and computing over this rich dataset. The ISB-CGC is a cloud-based platform that will serve as a large-scale data repository for TCGA data, while also providing the computational infrastructure and interactive exploratory tools necessary to carry out cancer genomics research at unprecedented scales. The ISB-CGC will provide interactive and programmatic access to the TCGA data, leveraging many aspects of Google Cloud Platform including BigQuery and Compute Engine. The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicians who prefer to use an interactive web-based application to access and explore the rich TCGA dataset, to computational scientists who want to write their own custom scripts using languages such as R or Python, accessing the data through APIs, to algorithm developers who want to spin up thousands of virtual machines to analyze hundreds of terabytes of sequence data. The ISB-CGC will allow scientists to interactively define and compare cohorts, examine the underlying molecular data for specific genes or pathways of interest, and share insights with collaborators around the globe.
- EDIT THE CALENDAR
Customise the different types of events you'd like to manage in this calendar.#legIndex/#totalLegs
- RESTRICT THE CALENDAR
Optionally, restrict who can view or add events to the team calendar.#legIndex/#totalLegs
- SHARE WITH YOUR TEAM
Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.#legIndex/#totalLegs
- ADD AN EVENT
The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.#legIndex/#totalLegs