Skip Navigation
NIH | National Cancer Institute | NCI Wiki   New Account Help Tips
Skip to end of metadata
Go to start of metadata

Extensible Markup Language (XML) is a text format that can capture structure by nesting tags, where each tag represents a biological entity (in the case of TCGA). TCGA uses XML to represent clinical and biospecimen information.

The format of an XML file must conform to its accompanying XSD.

About Clinical and Biospecimen Data

Clinical and biospecimen data are represented in two file types, XML and a tab-delimited text file type called biotab which present the same data structure in different ways. Both are open access data. They enable the collection of a series of barcodes corresponding to participants that fit within the clinical data types of interest.

Each XML file contains data for a single participant; each biotab file contains data for multiple participants.

Either type of file can be used to extract and aggregate aliquot barcodes associated with participants' clinical data. Once relevant sample or aliquot barcodes and data have been parsed from the available XML or biotab file, samples can be aggregated according to clinical data elements of interest. The aggregated barcodes can then be mapped to the relevant data (see TCGA barcode).

 

For more information about biotab file types, see Biotab .

 

Obtaining Relevant XML Files

A clinical XML file contains clinical data collected for a single participant. Clinical XML filenames always contain the participant ID, and the XML content of a clinical file always contains an element shared:bcr_patient_barcode whose text content is the participant ID.

The clinical data is represented in an XML as text content (the data itself), surrounded by XML tags. The XML tagname identifies the data.

    <coad:patient>
        <shared:tumor_tissue_site cde="2735776" owner="TSS" procurement_status="Completed" tier="1" xsd_ver="1.8">COLON</shared:tumor_tissue_site>
        <coad:histological_type cde="3081934" owner="TSS" procurement_status="Completed" tier="1" xsd_ver="1.9">Colon Adenocarcinoma</coad:histological_type>
        <shared:prior_diagnosis cde="61396" owner="TSS" procurement_status="Completed" tier="1" xsd_ver="2.2">NO</shared:prior_diagnosis>
        <shared:gender cde="2200604" owner="TSS" procurement_status="Completed" tier="1" xsd_ver="1.8">MALE</shared:gender>

        <shared:vital_status cde="2939553" owner="TSS" procurement_status="Not Available" tier="1" xsd_ver="1.8"/>
        <shared:days_to_birth cde="" owner="TSS" procurement_status="Not Available" tier="1" xsd_ver="1.16"/>
        <shared:days_to_last_known_alive cde="" owner="TSS" procurement_status="Not Available" tier="1" xsd_ver="1.16"/>
        <shared:days_to_death cde="" owner="TSS" procurement_status="Not Applicable" tier="1" xsd_ver="1.16"/>
        <shared:days_to_last_followup cde="" owner="TSS" procurement_status="Not Available" tier="1" xsd_ver="1.16"/>
        ...
    </coad:patient>
    ...

A number of tools are available for visualizing and parsing XML files, and most major scripting and programming languages (including Perl, Python, Ruby, and Java) have packages for efficiently parsing XML.

The DCC provides tools for searching and downloading clinical XML for desired studies. The TCGA Archive Search can be used to find clinical XML archives. Select the desired study in the Cancer Type box, and choose "Complete Clinical Set" in the Data Type box; Center should be set to "All". Select Find to open a list of clinical XML for all participants for which clinical data is available.

Acquiring clinical data from Tissue Source Sites is a human labor-intensive process. Because the mandate of TCGA is to make data available to the community as soon as it is acquired, acquisition of participant clinical data can lag substantially behind acquisition of assay data, and that lag can become apparent in the DCC's data distribution. Updates on data distribution are available through the TCGA-DATA-L listserv.

The filenames of clinical XML files contain the participant IDs. Use the filenames to generate the list of participants to use when aggregating their related aliquot barcodes and assay data.

 

 

 

 

 

 

 

 

 

 

 

 

 

  • No labels