CTIIP Primer

Contents of this Page

Introduction to CTIIP

Most cancer diagnoses are made based on images. You have to see a tumor, or compare images of it over time, to determine its level of threat. Ultrasounds, MRIs, and X-rays are all common types of images that radiologists use to collect information about a patient and perhaps cause a doctor to recommend a biopsy. Once that section of the tumor is under the microscope, pathologists learn more about it. To gather even more information, a doctor may order a genetic panel. If that panel shows that the patient has a genetic anomaly, the doctor may search for clinical trials that match it, or turn to therapies that researchers have already proven effective for this combination of tumor and genetic anomaly through recent advances in precision medicine.

Yet another way we learn about cancer in humans is through small animal research. Images from small animals allow detailed study of biological processes, disease progression, and response to therapy, with the potential to provide a natural bridge to human disease. Due to differences in how data is collected and stored about animals and humans, however, the bridge is man-made.

Each of these diagnostic images are at a different scale, from a different scientific discipline. A large-scale image like an X-ray may be almost life-size. Slices of tumors are smaller still. Like genes and proteins, you must put them on a slide under a microscope to see them. Not surprisingly, each of these image types require specialized knowledge to create, handle, and interpret them. While complementary, each specialist comes from a different scientific discipline.

If you were the patient, wouldn't you want your medical team to benefit from data collected about your cancer, no matter which discipline it belonged to?

The good news is that it is now possible to both create large databases of information about images and apply existing data standards. The bad news is that each of these databases is protected by proprietary formats that do not communicate with one another, and standards do not yet exist for all image types. Researchers from each of the disciplines under the umbrella called imaging refer to the images in a unique way, using different vocabulary. Wouldn't it be nice if a scientist could simply ask questions without regard to disciplinary boundaries and harness all of the available data about tissue, cells, genes, proteins, and other parts of the body to prove or disprove a hypothesis?

One promise of big data, such as that represented by the large but mutually-exclusive imaging data sets mentioned so far, is that mashups can be made that integrate two or more data sets in a single graphical interface so that doctors, pathologists, radiologists, and laboratory technicians can make connections that improve outcomes for patients. Such mashups require and await technical solutions in the areas of data standards and software development. A significant start to all of these technical solutions are the sub-projects of the National Cancer Institute Clinical and Translational Imaging Informatics Project (NCI CTIIP).

CTIIP Sub-Projects

As discussed so far, cancer research is needed across domains. To serve this need, the National Cancer Institute Clinical and Translational Imaging Informatics Project (NCI CTIIP) team plans to meet it by creating a data mashup interface, along with other software and standards, that accesses The Cancer Genome Atlas (TCGA) clinical and molecular data, The Cancer Imaging Archive (TCIA) in-vivo imaging data, caMicroscope pathology data, a pilot data set of animal model data, and relevant imaging annotation and markup data.

The common informatics infrastructure that will result from this project will provide researchers with analysis tools they can use to directly mine data from multiple high-volume information repositories, creating a foundation for research and decision support systems to better diagnose and treat patients with cancer.

CTIIP is composed of the following sub-projects. Each project is discussed on this page.

Sub-Project Name	Description
Digital Pathology and Integrated Query System	Addresses the interoperability of digital pathology data, improves integration and analytic capabilities between TCIA and TCGA, and raises the level of interoperability to create the foundation required for pilot demonstration projects in each of the targeted research domains: clinical imaging, pre-clinical imaging, and digital pathology imaging.
DICOM Standards for Small Animal Imaging; Use of Informatics for Co-clinical Trials	Addresses the need for standards in pre-clinical imaging and tests the informatics created in the Digital Pathology and Integrated Query System sub-project for decision support in co-clinical trials.
Pilot Challenges	Challenges will be designed to develop knowledge-extraction tools and compare decision-support systems for the three research domains, which will now be represented as a set of integrated data from TCIA and TCGA. The pilot challenges would use limited data sets for proof-of-concept, and test the informatics infrastructure needed for more rigorous “Grand Challenges” that could later be scaled up and supported by extramural initiatives.

The Importance of Data Standards

The common infrastructure that will result from CTIIP and its sub-projects depends on data interoperability, which is greatly aided by adherence to data standards. While image data standards exist to support communicating image data in a common way, the data standards that do exist for image data are inconsistently adopted. One reason for the lack of uniform adoption is that vendors of image management tools required for the analysis of imaging data have created these tools so that they only accept proprietary data formats. Researchers then make sure their data can be interpreted by these tools. The result is that images produced on different systems cannot be analyzed via the same mechanisms.

Another challenge for CTIIP with its goal of integrating data from complimentary domains is the lack of a defined standard for co-clinical and digital pathology data. Without a data standard for these domains, it is very difficult to share and leverage such data across studies and institutions. As part of the CTIIP project, the team will extend the DICOM model to co-clinical and small animal imaging.

NCI CBIIT has worked extensively for several years in the area of data standards for both clinical research and healthcare, working with the community and Standards Development Organizations (SDOs), such as the Clinical Data Interchange Standards Consortium (CDISC), Health Level 7 (HL7) and the International Organization for Standardization (ISO). From that work, Enterprise Vocabulary Services (EVS) and Cancer Data Standards Registry and Repository (caDSR) are harmonized with the Biomedical Research Integrated Domain Group (BRIDG), Study Data Tabulation Model (SDTM), and Health Level Seven^® Reference Information Model HL7 RIM models. Standardized Case Report Forms (CRFs), including those for imaging, have also been created. The CBIIT project work provides the bioinformatics foundation for semantic interoperability in digital pathology and co-clinical trials integrated with clinical and patient demographic data and data contained in TCIA and TCGA.

Within the three research domains that CTIIP intends to make available for integrative queries, only one, clinical imaging, has made some progress in terms of establishing a framework and standards for informatics solutions. Those standards include Annotation and Image Markup (AIM), which allow researchers to standardize annotations and markup for radiology and pathology images, and Digital Imaging and Communications in Medicine (DICOM), which is a standard for handling, storing, printing, and transmitting information in medical imaging. For pre-clinical imaging and digital pathology, there are no such standards that allow for the seamless viewing, integration, and analysis of disparate data sets to produce integrated views of the data, quantitative analysis, data integration, and research or clinical decision support systems.

As part of the DICOM Standards for Small Animal Imaging; Use of Informatics for Co-clinical Trials sub-project, the long-term goal is to generate DICOM-compliant images for small animal research. MicroAIM (µAIM) is currently in development to serve the unique needs of this domain.

The following table presents the data that the CTIIP team is integrating through various means. This integration relies on the expansion of software features and on the application of data standards, as described in subsequent sections of this document.

Domain	Data Set	Applicable Standard
Clinical Imaging	The Cancer Genome Atlas (TCGA) clinical and molecular data	DICOM
Clinical Imaging	The Cancer Imaging Archive (TCIA) in vivo imaging data	DICOM
Pre-Clinical	Small animal models	N/A MicroAIM in development
Digital Pathology	caMicroscope	DICOM
All	Annotations and markup on images	AIM

Digital Pathology and Integrated Query System

The goal of this foundational sub-project is to create a digital pathology image server that can accept images from multiple domains and run integrative queries on that data. Using this server, which is an extended version of caMicroscope, researchers can select data from different imaging data sets and use them in image algorithms. The first data sets that are being integrated on this image server are TCGA and TCIA.

The TCGA project is producing a comprehensive genomic characterization and analysis of 200 types of cancer and providing this information to the research community. TCIA and the underlying National Biomedical Image Archive (NBIA) manage well-curated, publicly-available collections of medical image data. The linkages between TCGA and TCIA are valuable to researchers who want to study diagnostic images associated with the tissue samples sequenced by TCGA. TCIA currently supports over 40 active research groups including researchers who are exploiting these linkages.

Although TCGA and TCIA comprise a rich, complementary, multi-discipline data set, they are in an infrastructure that provides limited ability to query the data. Researchers want to query both databases at the same time to identify cases based on all available data types. While TCGA and TCIA are DICOM-compliant, digital pathology and co-clinical/small animal model environments do not share the same data standards or do not use them consistently.

To address these limitations, the CTIIP team is developing a unified query interface to make it easier to analyze data from different research disciplines represented by TCGA, TCIA, and co-clinical/small animal model data. The lack of common data standards will not be a hindrance to data analysis, since the server that the unified query interface is on will accept whole slides without recoding. The unified query interface will also provide a common platform and data engine for the hosting of “pilot challenges," which are described in more detail below. Pilot challenges will advance biological and clinical research in a way that also integrates the clinical, co-clinical/small animal model, and digital pathology imaging disciplines.

Digital Pathology

Digital pathology, unlike its more mature radiographic counterpart, has yet to standardize on a single storage and transport media. In addition, each pathology-imaging vendor produces its own image management systems, making image analysis systems proprietary and not standardized. The result is that images produced on different systems cannot be analyzed via the same mechanisms. Not only does this lack of standards and the dominance of proprietary formats impact digital pathology, but it prevents digital pathology data from integrating with data from other disciplines.

The purpose of the digital pathology component of CTIIP is to support data mashups between image-derived information from TCIA and clinical and molecular metadata from TCGA. The team is using OpenSlide, a vendor-neutral C library, to extend the software of caMicroscope, a digital pathology server, to provide the infrastructure for these data mashups. The extended software will support some of the common formats adopted by whole slide vendors as well as basic image analysis algorithms. With the incorporation of common whole slide formats, caMicroscope will be able to read whole slides without recoding, which often introduces additional compression artifacts. With the addition of support for basic image analysis algorithms, (CKK: the following things can happen...). These additional features of caMicroscope will make it possible to integrate digital pathology images within TCIA and NBIA and provide a logical bridge from proprietary pathology formats to DICOM standards.

Data federation, a process whereby data is collected from different databases without ever copying or transferring the original data, is part of the new infrastructure as well. It will make it possible to create integrative queries using data from TCIA and TCGA. The software used to accomplish this data federation is Bindaas. Bindaas is middleware that is also used to build the backend infrastructure of caMicroscope. The team is extending Bindaas with a data federation capability that makes it possible to query data from TCIA and TCGA.

Image annotations also require standards so that they can be read by different imaging disciplines along with the rest of the image data. caMicroscope will also be extended to include standards-based image annotation using the Annotation Image Markup (AIM) standard.

Integrated Query System

To make data comparable, it must first be collected in a structured fashion. For example, TCGA relies on Common Data Elements, which are the standard elements used to validate TCGA clinical data. Second, data comparisons require common data standards. For example, when a tumor is described in a human or an animal, a data standard would require that the type of tumor match one of a discrete number of options using approved vocabulary, such as "brain".

The integrated query system currently in development will serve as an archive of images from multiple imaging disciplines, shown below.

Given the technical challenges inherent in such a system, technical solutions are being developed. One of the most fundamental to the success of the integrated query system is an Application Program Interface (API) that provides a Representational State Transfer (REST) API to TCIA metadata and image collections. This API is built using a middleware platform called Bindaas that is also used to build the backend infrastructure of caMicroscope. This API is being designed to support federation of multiple information repositories using the concepts of a data mashups. Because this infrastructure is open source and extensible, it can be expanded to include more data types and additional integration, as well as provide analytic and decision support, which will act as a foundation for a broader set of novel community research projects.

Small Animal/Co-clinical Improved DICOM Compliance and Data Integration

While the challenges of integrating small animal/co-clinical data with data on humans are steep, given the lack of common data standards, the potential rewards are great. These rewards depend on a common data standard for human and small animal data and support by equipment manufacturers for the standard.

The goal of the Small Animal/Co-clinical Improved DICOM Compliance and Data Integration sub-project is to directly compare data from co-clinical animal models to real-time clinical data from TCGA. The team will accomplish this by applying common data elements used in TCGA with animal applicability, such as estrogen-receptor (ER) negative and positive, to a co-clinical data set. Specifically, this sub-project will:

Develop a supplement to the DICOM standard to accommodate small animal imaging.
Identify a pilot co-clinical data set to integrate with TCIA and TCGA.

For example, consider the following research question, made possible through increased DICOM compliance by small animal/co-clinical data.

If you treat a mouse with an estrogen-receptor (ER) negative tumor with a certain drug, how does the outcome compare to that of a human with the same tumor and ER status?

With small animal/co-clinical data meeting the DICOM standard, researchers could find a mouse with the same kind of tumor and compare its response to various therapies that could help generate sophisticated diagnoses and treatment plans.

Pilot Challenges

The Pilot Challenges sub-project is unique within CTIIP because rather than focusing on data standards and integration, it demonstrates that integration

Challenges are being increasingly viewed as a mechanism to foster advances in a number of domains including healthcare and medicine. The US Federal government, as part of the open government initiative has underscored the role of challenges as a way to "promote innovation through collaboration and (to) harness the ingenuity of the American Public." Large quantities of publicly available data and cultural changes in the openness of science have now made it possible to use these challenges and crowdsourcing efforts to propel the field forward.

Sites such as Kaggle, Innocentive, and TopCoder are being used increasing in the computer science and data science communities in a range of creative ways. These are being leveraged by commercial entities such as Walmart in finding qualified employees while rewarding participants with monetary prizes as well as less tangible rewards such as public acknowledgement of their efforts for advancing the field.

In the biomedical domain, challenges have been used effectively in bioinformatics as seen by recent crowd-sourced efforts such as Critical Assessment of Protein Structure Prediction (CASP), the CLARITY Challenge for standardizing clinical genome sequencing analysis and reporting and the cancer Genome atlas Pan-cancer analysis Working Group, DREAM Challenges (Dialogue for Reverse Engineering Assessments and Methods), including the prostate challenge currently underway are being used for the assessment of predictive models of disease.

Some of the key advantages of challenges over conventional methods include 1) scientific rigor (sequestering the test data), 2) comparing methods on the same datasets with the same, agreed-upon metrics, 3) allowing computer scientists without access to medical data to test their methods on large clinical datasets, 4) making resources available, such as source code, and 5) bringing together diverse communities (that may traditionally not work together) of imaging and computer scientists, machine learning algorithm developers, software developers, clinicians and biologists.

However, despite this potential, there are a number of challenges. Medical data is usually governed by privacy and security policies such as HIPPA that make it difficult to share patient data. Patient health records can be very difficult to completely de-identify. Medical imaging data, especially brain MRIs can be particularly challenging as once could easily reconstruct a recognizable 3D model of the subject.

Crowdsourcing can blur the lines of intellectual property ownership and can make it difficult to translate the algorithms developed in the context of a challenge into a commercial product. A hypothetical example is the development of an algorithm by a researcher at a university for a contest held by a commercial entity with the express purpose of implementing it in a product. Although the researcher who won the contest may have been compensated monetarily, as the IP was developed during her time at the university, the IP is now owned by the University who many not release the rights to the company without further licensing fees.

The infrastructure requirements to both host and participate in some of the "big data" efforts can be monumental. Medical imaging data can be large, historically requiring the shipping of disks to participants. The computing resourcing needed to process these large datasets may be beyond what is available to individual participants. For the organizers, creating the infrastructure that is secure, robust and scalable can require resources beyond the reach of many researchers. These resources included IT manpower support, compute capability, and domain knowledge.

The medical imaging community has conducted a host of challenges at conferences such as MICCAI and SPIE. However, these have typically have been modest in scope (both in terms of data size and number of participants). Medical imaging data poses additional challenges to both participants and organizers. For organizers, ensure that the data are free of PHI is both critical and non-trivial. Medical data is typically acquired in DICOM format. However, ensuring that a DICOM file is free of PHI requires domain knowledge and specialized software tools. Multimodal imaging data can be extremely large. Imaging formats for pathology images can be proprietary and interoperability between formats can require additional software development efforts. Encouraging non-imaging researchers (e.g. machine learning scientists) to participate in imaging challenges can be difficult due to the domain knowledge required to convert medical imaging into a set of feature vectors. For participants, access to large compute clusters with computing power, storage space and bandwidth can prove difficult. Medical imaging data is challenging for non-imaging researchers.

However, it is imperative that the imaging community develops the tools and infrastructure necessary to host these challenges and potentially enlarge the pool of methods by making it more feasible for non-imaging researchers to participate. Resources such as the Cancer Imaging Archive (TCIA) have greatly reduced the burden for sharing medical imaging data within the cancer community and making these data available for use in challenges. Although a number of challenge platforms exist currently, we are not aware of any systems that meet all the requirements necessary to currently host medical imaging challenge.

In this article, we review a few historical imaging challenges. We then list the requirements we believe to be necessary (and nice to have) to support large-scale multimodal imaging challenges. We then review existing systems and develop a matrix of features and tools. Finally, we make some recommendations for developing Medical Imaging Challenge Infrastructure (MedICI), a system to support medical imaging challenges.

by developing knowledge extraction tools and comparing the decision support systems for clinical imaging, co-clinical imaging, and digital pathology, which will now be represented as a set of integrated data from TCIA and TCGA. The intent is not to specifically implement a rigorous “Grand Challenge”, but rather to develop “Pilot Challenge “projects. These would utilize limited data sets for proof-of-concept, and test the informatics infrastructure needed for such “Grand Challenges” that would be scaled up and supported by extramural initiatives later in 2014 and beyond.

a) Leverage and extend the above platform and data systems to validate and share algorithms, support precision medicine and clinical decision making tools, including correlation of imaging phenotypes with genomics signatures. The aims are fashioned as four complementary “Pilot Challenges”.

i) Clinical Imaging: QIN image data for several modalities/organ systems are already hosted on TCIA. Pilot challenge projects are being explored for X-ray CT, DWI MRI and PET CT similar to the HUBzero pilot CT challenge project.

ii) Pre-clinical / Co-clinical Imaging leveraging the Mouse Models of Human Cancer Consortium (MHHCC) Glioblastoma co-clinical trials with associated ’omics data sets from the Human Brain Consortium. This proof of concept will focus on bringing together ‘omics and imaging data into a single platform.

iii) Digital Pathology clinical support. Leveraging Aims1-3 develop open source image analysis algorithms which complement ‘omics data sets and provide additional decision support.

iv) Enable community sharing of algorithms on a software clearinghouse platform such as HubZero.

Three pilot challenges–pathology, radiology, co-clinical.

Medical Image Computational and computer-assisted Intervention: MICCAI

Interventions in tumors, cardiology, etc that are image-based

Mass General will guide the pilots

Ground truth: find the compatibility of the informatics that we need to run pilots. Take images out of TCIA, CGA, clinical data and compare them.

Jasharee doing MICCAI Challenge in Munich. Segmentation of nuclear imaging in pathology. Combined radiology and pathology classification.

Want to be able to say that these informatics allow us to compare the pathology, rad, co-clinical findings.

Document the approach, technology, application to do a MICCAI challenge the way Jaysharee does it. See their order of march.

Challenges: read one-page document. We want to use pathology images in the challenges. The tool used to display the markup and annotations (for the pathology images) is caMicroscope. There will be a challenge in which animal model data will be used. Give people images they have never seen before and develop algorithms (like to circle all the nuclei). Ground truth decided by a pathologist and a radiologist. The algorithm that comes closest to ground truth is the winner.

Compare the decision support systems for three imaging research domains: Clinical Imaging, Pre-clinical Imaging, and Digital Pathology

•Leverage and extend the above platform and data systems to validate and share algorithms, support precision medicine and clinical decision-making tools, including correlation of imaging phenotypes with genomics signatures. The aims are fashioned as four complementary “Pilot Challenges”.

Four complimentary pilot challenges:

Clinical Imaging:

•QIN image data for several modalities/organ systems are already hosted on TCIA. Pilot challenge projects are being explored for X-ray CT, DWI MRI and PET CT similar to the HUBzero pilot CT challenge project.

Pre-clinical / Co-clinical Imaging:

•Leveraging the Mouse Models of Human Cancer Consortium (MHHCC) Glioblastoma co-clinical trials with associated ’omics data sets from the Human Brain Consortium. This proof of concept will focus on bringing together ‘omics and imaging data into a single platform.

Digital Pathology Clinical Support:

•Leveraging Aims1-3 develop open source image analysis algorithms which complement ‘omics data sets and provide additional decision support.

Community Sharing:

•Enable community sharing of algorithms on a software clearinghouse platform such as HubZero.

Pilot challenges to compare the decision support systems for three domains.

Challenge Management System, MedICI

Jaysharee's program: Medical Imaging Challenge Infrastructure: MedICI

Based on open-source CodaLab
ePAD (created by Daniel Rubin's group at Stanford): tool for annotating images, creates AIM images
caMicroscope

http://miccai.cloudapp.net:8000/competitions/28

Competition #1: MICCAI challenge has a training phase where they train their algorithms. A test phase where they run their algorithms on images they have never seen before. They are compared to the ground truth that is determined beforehand. caMicroscope is used to see what is there before and to visualize the results. Overlap/completeness match determines the winner.
Competition #2: They are given slides.

From PPT: Use titles of slides

Setting up a competition by an organizer. Organizer creates competition bundle.

Can go to cancerimagingarchive.net and create shared lists. Shared lists are pulled into CodaLab. That is how they get the test and training data.

Next is to create ground truth.

Regions of interest in a tumor for annotations are necrosis, adema, and active cancer. Radiologists create the ground truth.

Once participants upload their results, they can see them in ePad.

Scenarios

Need to generate proper therapy for a patient. Look at in vivo imaging, radiology and pathology, run a gene panel to look for abnormal. Look at co-clinical trials (model of a tumor in a mouse that is similar to a human. Experiment therapies on mice.) Run an integrative query to develop a sophisticated diagnosis. Search big data.

Visual pathology integrative queries–Ashish at Emory. Imaging consistent with ground truth.

Need to explain how the challenge management system and integrative query system play together in a scientific scenario.

three tocs: one for challenge steps, one for int query sys. how well does it integrate; what are the common–how do we annotate the tumor in MedICI such that it is compatible with the annotations in the components of the integrative query system. What relationships can we find in the informatics in the animal and patient findings.

Describe each section separately and then see if we can merge the two to answer the scientific question.

Informatics help us communicate. It can help us better treat our patients.

For example, breast cancer has biomarkers (progesterone status, etc.). One question to ask is "if the estrogen status is negative in humans, what does the pathology look like?" Then compare this to mice. Is the model we have a good model for the human condition?

Content

Space Tools