About Clinical and Biospecimen Data
Clinical and biospecimen data are represented in two file types, XML and a tab-delimited text file type called biotab which present the same data structure in different ways. Both are open access data. They enable the collection of a series of barcodes corresponding to participants that fit within the clinical data types of interest.
Each XML file contains data for a single participant; each biotab file contains data for multiple participants.
Either type of file can be used to extract and aggregate aliquot barcodes associated with participants' clinical data. Once relevant sample or aliquot barcodes and data have been parsed from the available XML or biotab file, samples can be aggregated according to clinical data elements of interest. The aggregated barcodes can then be mapped to the relevant data (see TCGA barcode).
For more information about biotab file types, see Biotab .
A clinical XML file contains clinical data collected for a single participant. Clinical XML filenames always contain the participant ID, and the XML content of a clinical file always contains an element
shared:bcr_patient_barcode whose text content is the participant ID.
The clinical data is represented in an XML as text content (the data itself), surrounded by XML tags. The XML tagname identifies the data.
A number of tools are available for visualizing and parsing XML files, and most major scripting and programming languages (including Perl, Python, Ruby, and Java) have packages for efficiently parsing XML.
The DCC provides tools for searching and downloading clinical XML for desired studies. The TCGA Archive Search can be used to find clinical XML archives. Select the desired study in the Cancer Type box, and choose "Complete Clinical Set" in the Data Type box; Center should be set to "All". Select Find to open a list of clinical XML for all participants for which clinical data is available.
Acquiring clinical data from Tissue Source Sites is a human labor-intensive process. Because the mandate of TCGA is to make data available to the community as soon as it is acquired, acquisition of participant clinical data can lag substantially behind acquisition of assay data, and that lag can become apparent in the DCC's data distribution. Updates on data distribution are available through the TCGA-DATA-L listserv.
The filenames of clinical XML files contain the participant IDs. Use the filenames to generate the list of participants to use when aggregating their related aliquot barcodes and assay data.