The NCI Center for Biomedical Informatics and Information Technology (CBIIT) Speaker Series is a bi-weekly knowledge-sharing forum featuring speakers on topics of interest to the biomedical informatics and research communities. General topics to be discussed include but are not limited to novel experimental approaches in basic research that require innovative informatics solutions; general informatics methodologies for specific tasks such as natural language processing and data exchange/integration; novel software applications (proprietary or open source); standards; ontologies; open-source development projects; human/computer interactions; future trends in biomedical informatics research and development; and CBIIT/NCIP partnerships inside and outside NCI/NIH.

Helen BermanSYNOPSIS:

As the crystal structures of biological macromolecules were being determined, a new field of structural biology was born. Inspired by these new structures, the scientific community worked to establish a home to archive and share the data emerging from these experiments. The Protein Data Bank (PDB) was established in 1971 with seven structures. The PDB provides a repository for scientists who generate the data, and an access point for researchers and students to find the information needed to drive additional studies. Today, the PDB contains and supports online access to ~117,000 biomacromolecules that help researchers understand aspects of biology, including medicine, agriculture, and biological energy. The ways in which the interrelationships among science, technology, and community have driven the evolution   of the PDB resource for more than forty years will be discussed. The PDB archive is managed by the Worldwide Protein Data Bank (, whose members are the RCSB PDB, PDBe, PDBj and BMRB.

Curtis Langlotz SYNOPSIS:

The imaging report is an essential source of clinical imaging information. It documents critical information about the patient's health and provides a professional interpretation of the images. However, the vast majority of report information remains narrative, a major obstacle to the rapid extraction and re-use of discrete imaging data. Structured reporting facilitates linking of imaging observations to clinical and genomic data, and is increasingly being adopted by clinical imaging practices. However, most imaging reports are used only once by the clinician who ordered the imaging study and are rarely used again for research, clinical care, or analytics. This presentation will describe the likely future of the imaging report, including efforts underway to standardize radiology report information, and the use of machine learning and natural language processing techniques to extract the semantic elements of the radiology report. These novel technologies enable connections between images and the electronic health record, and represent a vital part of the future of medical research.

Martin MorganSYNOPSIS:

Bioconductor is a widely-used collection of R packages for the statistical analysis and comprehension of high-throughput genomic data. Biocondctor has strengths in sequence (RNA-seq, ChIP-seq, called variants, ...) and microarray (expression, methylation, copy number, ...) analysis, as well as significant facilities for flow cytometry, proteomics, and many other omics domains. The breadth of available facilities, coupled with principles of interoperability and reproducibility, make Biocondctor an ideal platform for integrative approaches to cancer genomics. This presentation outlines technical aspects of recent and forthcoming facilities to enable integrative cancer genomic analysis in Bioconductor. We discuss our own work to enable routine integration of large-scale consortium (e.g., ENCODE, Ensembl), annotation into analysis work flows, development within Biocondctor of facilities to manage multiple-assay experiments, and approaches to scaling R's in-memory model to large scale data sets. The presentation concludes with a brief overview of integrative approaches contributed to Bioconductor by our international contributors.

David Hanauer


With the continued adoption of electronic health record (EHR) systems, healthcare centers are developing large repositories of unstructured clinical notes that were created as part of routine care. These data contain rich details that are often found nowhere else in the EHR, and can be valuable for research tasks ranging from cohort identification and eligibility determination to extracting phenotypic details in support of clinical and translational research. However, access to the data "locked" within these documents has historically been challenging for research teams, many of whom lack the expertise to utilize natural language processing tools. To address this problem we developed the Electronic Medical Record Search Engine (EMERSE) which is an information retrieval tool designed with the end-user in mind. Careful attention has been paid to usability and to ensure that EMERSE has the type of functionality needed by a majority of researchers needing access to the data found within the clinical notes. EMERSE has been used, and continues to be enhanced, at the University of Michigan for over 10 years, and has had a wide and highly satisfied user base. One of the largest collective user groups has been our Cancer Center's Clinical Trials Office. EMERSE is available at no cost for academic use and we are actively seeking partners interested in adopting the tool. Additional information can be found at In this talk, we will provide a live demonstration of the tool, by walking through the various features and capabilities to show the kinds of tasks it can be used for.

