The TCGA DCC receives data submissions for many data types, ranging from genomic data (which spans multiple instrument platforms) to biospecimen and clinical data associated with TCGA participants. In order to provide this multi-sourced data in stable and predictable formats to end users, the DCC specifies general packaging requirements as well as specific file formatting for certain data types. Upon submission, QCLive checks for compliance with these requirements. Data that fail validation are kept from deployment and notified to corresponding submitting institutions. These notifications provide detailed error lists so that submitters can make corrections before resubmitting their data.
The primary goals of DCC data validation are to ensure that data can be:
- traceable to the originating participant
- deployed consistently and predictably
- loaded into the DCC relational database management system robustly
- cataloged correctly with key metadata such as date, revision version, submitting center, disease study, data type and data level.
Scientific validation of data values is not a primary goal, although in response to shifting project requirements and requests, some specific data types or files are subjected to more stringent analytical or ontological validation. The primary reason for this is that scientific data quality control is best performed (and generally performed) by subject matter experts at the submission centers themselves.
These guidelines are in keeping with the DCC mission, which is to coordinate and provide timely access to TCGA data. One consequence of these goals is that not every file undergoes specialized validation.
All archives and data files submitted to the DCC undergo the validation steps captured in the diagram below.
Rectangles represent validation steps and ovals represent processing steps. Hexagons represent data loading or transformations that take place post validation.
Validations that apply to all archives, originating from any center type, are depicted as blue rectangles. File-specific (and hence Data Type- and Center Type-specific) validations are depicted as green (if GCC), red (if GSC) or yellow (if BCR) rectangles. There are currently no validations for GDAC data as none have been submitted to the DCC via the standard process.